bug#15260: cannot build in a directory with non-ascii characters

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#15260: cannot build in a directory with non-ascii characters
@ 2013-09-03 17:46 Glenn Morris
  2013-10-23 20:48 ` Glenn Morris
  0 siblings, 1 reply; 50+ messages in thread
From: Glenn Morris @ 2013-09-03 17:46 UTC (permalink / raw)
  To: 15260

Package: emacs
Severity: important
Version: 24.3

It seems Emacs (still) cannot be built in a directory whose name contains
non-ascii characters. Ref:

http://lists.gnu.org/archive/html/help-gnu-emacs/2013-09/msg00033.html

If it cannot be made to work, configure should abort with an error in
such cases.

I have some vague memory that it also might not work with spaces in the
names, but did not test.

Similar restrictions may apply to the install --prefix as well.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-09-03 17:46 bug#15260: cannot build in a directory with non-ascii characters Glenn Morris
@ 2013-10-23 20:48 ` Glenn Morris
  2013-10-24 18:25   ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Glenn Morris @ 2013-10-23 20:48 UTC (permalink / raw)
  To: 15260

Glenn Morris wrote:

> If it cannot be made to work, configure should abort with an error in
> such cases. [non-ascii directories]

Done. Leaving this open as a wishlist to make it work.

> I have some vague memory that it also might not work with spaces in the
> names, but did not test.

This works now - http://debbugs.gnu.org/15675 .





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-23 20:48 ` Glenn Morris
@ 2013-10-24 18:25   ` Eli Zaretskii
  2013-10-24 18:35     ` Glenn Morris
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-24 18:25 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15260

> From: Glenn Morris <rgm@gnu.org>
> Date: Wed, 23 Oct 2013 16:48:42 -0400
> 
> Glenn Morris wrote:
> 
> > If it cannot be made to work, configure should abort with an error in
> > such cases. [non-ascii directories]
> 
> Done. Leaving this open as a wishlist to make it work.

  dnl configure sets LC_ALL=C early on, so this range should work.
  case "$var" in
    *[[^\ -~]]*) AC_MSG_ERROR([Emacs cannot be built or installed in a directory whose name contains non-ASCII characters: $var]) ;;
  esac

This is quite drastic.  Do we understand what is the underlying
technical reason for the build failures?  The bug reports didn't give
any explanations, only the fact that moving to a pure-ASCII directory
fixed the problem.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-24 18:25   ` Eli Zaretskii
@ 2013-10-24 18:35     ` Glenn Morris
  2013-10-25 14:25       ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Glenn Morris @ 2013-10-24 18:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

Eli Zaretskii wrote:

>   case "$var" in
>     *[[^\ -~]]*) AC_MSG_ERROR([Emacs cannot be built or installed in a directory whose name contains non-ASCII characters: $var]) ;;
>   esac
>
> This is quite drastic. 

I don't think so. The alternative is a cryptic failure during the build stage.

> Do we understand what is the underlying technical reason for the
> build failures? 

Something to do with failure to find files, just as it was 6 years ago.
http://lists.gnu.org/archive/html/emacs-devel/2007-05/msg00984.html

The immediate problem for me is a dump failure:

    Finding pointers to doc strings...
    Finding pointers to doc strings...done
    Dumping under the name emacs
    emacs: Can't open /path/to/non-ascii/src/temacs for reading: No such file
    or directory
    make[1]: *** [bootstrap-emacs] Error 1

Why not make a non-ASCII directory and try it yourself...





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-24 18:35     ` Glenn Morris
@ 2013-10-25 14:25       ` Eli Zaretskii
  2013-10-25 17:08         ` Glenn Morris
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-25 14:25 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15260

> From: Glenn Morris <rgm@gnu.org>
> Cc: 15260@debbugs.gnu.org
> Date: Thu, 24 Oct 2013 14:35:15 -0400
> 
> Eli Zaretskii wrote:
> 
> >   case "$var" in
> >     *[[^\ -~]]*) AC_MSG_ERROR([Emacs cannot be built or installed in a directory whose name contains non-ASCII characters: $var]) ;;
> >   esac
> >
> > This is quite drastic. 
> 
> I don't think so. The alternative is a cryptic failure during the build stage.
> 
> > Do we understand what is the underlying technical reason for the
> > build failures? 
> 
> Something to do with failure to find files, just as it was 6 years ago.
> http://lists.gnu.org/archive/html/emacs-devel/2007-05/msg00984.html
> 
> The immediate problem for me is a dump failure:
> 
>     Finding pointers to doc strings...
>     Finding pointers to doc strings...done
>     Dumping under the name emacs
>     emacs: Can't open /path/to/non-ascii/src/temacs for reading: No such file
>     or directory
>     make[1]: *** [bootstrap-emacs] Error 1

Does the change below help?

> Why not make a non-ASCII directory and try it yourself...

It requires too much setup on my part (this cannot be simulated on
Windows without too much hassle).  But I will do that if there's no
easier way.  I just thought that some analysis has been done already.

=== modified file 'src/emacs.c'
--- src/emacs.c	2013-10-20 16:47:42 +0000
+++ src/emacs.c	2013-10-25 14:21:47 +0000
@@ -2044,11 +2044,15 @@ You must run Emacs in batch mode in orde
 
   CHECK_STRING (filename);
   filename = Fexpand_file_name (filename, Qnil);
+  filename = ENCODE_FILE (filename);
   if (!NILP (symfile))
     {
       CHECK_STRING (symfile);
       if (SCHARS (symfile))
-	symfile = Fexpand_file_name (symfile, Qnil);
+	{
+	  symfile = Fexpand_file_name (symfile, Qnil);
+	  symfile = ENCODE_FILE (symfile);
+	}
     }
 
   tem = Vpurify_flag;






^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-25 14:25       ` Eli Zaretskii
@ 2013-10-25 17:08         ` Glenn Morris
  2013-10-25 18:31           ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Glenn Morris @ 2013-10-25 17:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

Eli Zaretskii wrote:

> It requires too much setup on my part (this cannot be simulated on
> Windows without too much hassle).

Sorry, I assumed you could build on fencepost. I'm expecting multiple
points of failure, so this might not be an efficient process...

The first time, I was trying an out-of-tree build in a non-ascii build
directory, but still with ascii srcdir. Using an in-place build in a
non-ascii directory fails to even start temacs (this is with or without
your patch):

    Warning: arch-independent data dir
    `/tmp/EMACS/share/emacs/24.3.50/etc/': No such file or directory
    Error: charsets directory not found:
    /tmp/EMACS/share/emacs/24.3.50/etc/charsets
    Emacs will not function correctly without the character map files.
    Please check your installation!
    make[1]: *** [bootstrap-emacs] Error 1

/tmp/EMACS was my install --prefix. It's not supposed to exist until
after installation, but the code that tries to find etc/ is presumably
mistakenly concluding that srcdir/etc does not exist and that it must be
running installed.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-25 17:08         ` Glenn Morris
@ 2013-10-25 18:31           ` Eli Zaretskii
  2013-10-25 18:40             ` Glenn Morris
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-25 18:31 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15260

> From: Glenn Morris <rgm@gnu.org>
> Cc: 15260@debbugs.gnu.org
> Date: Fri, 25 Oct 2013 13:08:08 -0400
> 
> Eli Zaretskii wrote:
> 
> > It requires too much setup on my part (this cannot be simulated on
> > Windows without too much hassle).
> 
> Sorry, I assumed you could build on fencepost.

That's the backup plan, yes.

> I'm expecting multiple
> points of failure, so this might not be an efficient process...

I don't think we should do it this way.  I was just asking about the
current state of knowledge.

I presume that the changes I suggested didn't help?  (They are TRT
anyway, so I will install them regardless.)

> The first time, I was trying an out-of-tree build in a non-ascii build
> directory, but still with ascii srcdir. Using an in-place build in a
> non-ascii directory fails to even start temacs (this is with or without
> your patch):
> 
>     Warning: arch-independent data dir
>     `/tmp/EMACS/share/emacs/24.3.50/etc/': No such file or directory
>     Error: charsets directory not found:
>     /tmp/EMACS/share/emacs/24.3.50/etc/charsets
>     Emacs will not function correctly without the character map files.
>     Please check your installation!
>     make[1]: *** [bootstrap-emacs] Error 1
> 
> /tmp/EMACS was my install --prefix. It's not supposed to exist until
> after installation, but the code that tries to find etc/ is presumably
> mistakenly concluding that srcdir/etc does not exist and that it must be
> running installed.

So in the above, /tmp/EMACS/share/emacs/24.3.50/etc/ is pure-ASCII,
and the non-ASCII directory is in the source tree, is that right?





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-25 18:31           ` Eli Zaretskii
@ 2013-10-25 18:40             ` Glenn Morris
  2013-10-25 18:46               ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Glenn Morris @ 2013-10-25 18:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

Eli Zaretskii wrote:

> I presume that the changes I suggested didn't help?  (They are TRT
> anyway, so I will install them regardless.)

It helps for "ascii srcdir, non-ascii builddir", but there are still
problems later on, again related to Emacs mistakenly believing that
certain directories do not exist, when they do (Warning: arch-dependent
data dir `...' No such file or directory; etc).

The "non-ascii srcdir == builddir" case fails even earlier, due to not
finding etc.

> So in the above, /tmp/EMACS/share/emacs/24.3.50/etc/ is pure-ASCII,
> and the non-ASCII directory is in the source tree, is that right?

Yes. I literally did (in a non-ascii) directory:

./configure --prefix=/tmp/EMACS

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-25 18:40             ` Glenn Morris
@ 2013-10-25 18:46               ` Eli Zaretskii
  2013-10-25 19:27                 ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-25 18:46 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15260

> From: Glenn Morris <rgm@gnu.org>
> Cc: 15260@debbugs.gnu.org
> Date: Fri, 25 Oct 2013 14:40:46 -0400
> 
> Eli Zaretskii wrote:
> 
> > I presume that the changes I suggested didn't help?  (They are TRT
> > anyway, so I will install them regardless.)
> 
> It helps for "ascii srcdir, non-ascii builddir"

Good, so one down, N - 1 to go ;-)

> but there are still
> problems later on, again related to Emacs mistakenly believing that
> certain directories do not exist, when they do (Warning: arch-dependent
> data dir `...' No such file or directory; etc).
> 
> The "non-ascii srcdir == builddir" case fails even earlier, due to not
> finding etc.

OK, I will take a closer look.  Thanks for the info.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-25 18:46               ` Eli Zaretskii
@ 2013-10-25 19:27                 ` Eli Zaretskii
  2013-10-26  7:50                   ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-25 19:27 UTC (permalink / raw)
  To: rgm; +Cc: 15260

> Date: Fri, 25 Oct 2013 21:46:52 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> > but there are still
> > problems later on, again related to Emacs mistakenly believing that
> > certain directories do not exist, when they do (Warning: arch-dependent
> > data dir `...' No such file or directory; etc).
> > 
> > The "non-ascii srcdir == builddir" case fails even earlier, due to not
> > finding etc.
> 
> OK, I will take a closer look.  Thanks for the info.

I think I see the problem.  All those PATH_* variables that come from
epaths.h yield encoded file names (because they were written by the
shell).  But we never decode them before using them in init_callproc
and init_callproc_1.  Similar things happen with decode_env_path: it
calls 'getenv', but never decodes the values it gets from that.

I will take a crack on fixing these.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-25 19:27                 ` Eli Zaretskii
@ 2013-10-26  7:50                   ` Eli Zaretskii
  2013-10-26 19:15                     ` Glenn Morris
  2013-10-27  4:28                     ` Stefan Monnier
  0 siblings, 2 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-26  7:50 UTC (permalink / raw)
  To: rgm, Stefan Monnier, Kenichi Handa; +Cc: 15260

> Date: Fri, 25 Oct 2013 22:27:19 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> > Date: Fri, 25 Oct 2013 21:46:52 +0300
> > From: Eli Zaretskii <eliz@gnu.org>
> > Cc: 15260@debbugs.gnu.org
> > 
> > > but there are still
> > > problems later on, again related to Emacs mistakenly believing that
> > > certain directories do not exist, when they do (Warning: arch-dependent
> > > data dir `...' No such file or directory; etc).
> > > 
> > > The "non-ascii srcdir == builddir" case fails even earlier, due to not
> > > finding etc.
> > 
> > OK, I will take a closer look.  Thanks for the info.
> 
> I think I see the problem.  All those PATH_* variables that come from
> epaths.h yield encoded file names (because they were written by the
> shell).  But we never decode them before using them in init_callproc
> and init_callproc_1.  Similar things happen with decode_env_path: it
> calls 'getenv', but never decodes the values it gets from that.
> 
> I will take a crack on fixing these.

We definitely need to decode file names in init_callproc_1,
init_callproc, and init_lread.

But here's where things get hairy: when temacs starts, preloaded Lisp
files are not yet loaded, and consequently file-name-coding-system and
default-file-name-coding-system are both nil.  In such a case,
currently DECODE_FILE is a no-op.

So we need some way of getting temacs to know what coding-system to
use to decode file names during its initialization phase, without
relying on the database we have in locale-language-names.  This
probably calls for a separate variable, init-file-name-coding-system,
say.  But how to assign a correct value to it?

I understand that most Posix systems nowadays use UTF-8 for file
names, so I guess we can fall back on that.  On MS-Windows, there's a
system call that returns the necessary information, so there's no
problem for MS-Windows.  The question is what to do for Posix systems
that don't use UTF-8?  I see 2 possibilities:

 . Try to parse the value of LANG with some shell or Sed script, and
   come up with a suitable value.

 . Ask the user to specify the encoding as a switch to the configure
   script.

In both cases, communicate the value to temacs via --eval on its
command line.

Comments and opinions are welcome.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-26  7:50                   ` Eli Zaretskii
@ 2013-10-26 19:15                     ` Glenn Morris
  2013-10-26 20:04                       ` Eli Zaretskii
  2013-10-27  4:28                     ` Stefan Monnier
  1 sibling, 1 reply; 50+ messages in thread
From: Glenn Morris @ 2013-10-26 19:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

Eli Zaretskii wrote:

> Comments and opinions are welcome.

Sounds like a fair bit of work, for something that doesn't seem very
important. If my testing was correct, the problem only occurs during
building, not after Emacs is installed (does that tally with what you
found?). And I can't see any reason why anyone _needs_ to build Emacs in
a directory with non-ASCII chars. 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-26 19:15                     ` Glenn Morris
@ 2013-10-26 20:04                       ` Eli Zaretskii
  2013-10-27  3:56                         ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-26 20:04 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15260

> From: Glenn Morris <rgm@gnu.org>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,  Kenichi Handa <handa@gnu.org>,  15260@debbugs.gnu.org
> Date: Sat, 26 Oct 2013 15:15:06 -0400
> 
> Sounds like a fair bit of work, for something that doesn't seem very
> important.

It might be important for people who build Emacs on non-English
language systems.

> If my testing was correct, the problem only occurs during
> building, not after Emacs is installed (does that tally with what you
> found?).

It definitely happens when building.  I didn't look deep enough to see
what happens once Emacs is installed.  The code is definitely wrong.

> And I can't see any reason why anyone _needs_ to build Emacs in
> a directory with non-ASCII chars. 

It might be a natural thing in some quarters.  E.g., Emacs sources
might be a subdirectory of some parent directory with a non-ASCII name
where many other packages are built.

Anyway, if the project thinks it's not important enough, I have better
things to do.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-26 20:04                       ` Eli Zaretskii
@ 2013-10-27  3:56                         ` Eli Zaretskii
  2013-10-27 16:19                           ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-27  3:56 UTC (permalink / raw)
  To: rgm; +Cc: 15260

> Date: Sat, 26 Oct 2013 23:04:49 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> > If my testing was correct, the problem only occurs during
> > building, not after Emacs is installed (does that tally with what you
> > found?).
> 
> It definitely happens when building.  I didn't look deep enough to see
> what happens once Emacs is installed.  The code is definitely wrong.

Btw, are you sure the installed Emacs doesn't find the files under the
source tree?  Did you try to remove or rename it after installing?





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-27  3:56                         ` Eli Zaretskii
@ 2013-10-27 16:19                           ` Eli Zaretskii
  2013-10-27 19:02                             ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-27 16:19 UTC (permalink / raw)
  To: rgm; +Cc: 15260

> Date: Sun, 27 Oct 2013 05:56:44 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> > Date: Sat, 26 Oct 2013 23:04:49 +0300
> > From: Eli Zaretskii <eliz@gnu.org>
> > Cc: 15260@debbugs.gnu.org
> > 
> > > If my testing was correct, the problem only occurs during
> > > building, not after Emacs is installed (does that tally with what you
> > > found?).
> > 
> > It definitely happens when building.  I didn't look deep enough to see
> > what happens once Emacs is installed.  The code is definitely wrong.
> 
> Btw, are you sure the installed Emacs doesn't find the files under the
> source tree?  Did you try to remove or rename it after installing?

Further testing indicates that it indeed works to install in a
non-ASCII directory after building.  But it only barely works, at
least in my testing: the various files and directories in
doc-directory, load-path, etc. are unibyte strings, so using them only
works if they are passed to file primitives.  If you try to invoke a
program with one of these values as a command-line argument, the
program will fail (unless your locale encoding is identical to
file-name encoding).  And even using the unibyte strings in
conjunction with files is fragile, as, for example, 'equal' will not
compare unibyte and multibyte strings of the same bytes as equal.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-27 16:19                           ` Eli Zaretskii
@ 2013-10-27 19:02                             ` Eli Zaretskii
  2013-10-27 19:43                               ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-27 19:02 UTC (permalink / raw)
  To: rgm, Stefan Monnier, Kenichi Handa; +Cc: 15260

The first few problems that pop up when building from the source tree
whose parent has a non-ASCII name are solved by the changes below.

I'm not very fond of these changes, especially the last one: it all
looks very fragile and ad-hoc, and that's still on a system with a
UTF-8 locale, where things should be relatively easy.

After applying these changes, temacs comes up and dumps itself, but
fails to find simple.el and bytecomp.el when it proceeds to compiling
Lisp files.  I guess now load-path is the culprit.

Stay tuned.


=== modified file 'lisp/loadup.el'
--- lisp/loadup.el      2013-10-08 15:11:29 +0000
+++ lisp/loadup.el      2013-10-27 18:26:12 +0000
@@ -150,7 +150,9 @@
 (load "epa-hook")
 ;; Any Emacs Lisp source file (*.el) loaded here after can contain
 ;; multilingual text.
-(load "international/mule-cmds")
+(let ((dfn-coding default-file-name-coding-system))
+  (load "international/mule-cmds")
+  (setq default-file-name-coding-system dfn-coding))
 (load "case-table")
 ;; This file doesn't exist when building a development version of Emacs
 ;; from the repository.  It is generated just after temacs is built.
@@ -163,7 +165,9 @@
 (load "language/cyrillic")
 (load "language/indian")
 (load "language/sinhala")
-(load "language/english")
+(let ((dfn-coding default-file-name-coding-system))
+  (load "language/english")
+  (setq default-file-name-coding-system dfn-coding))
 (load "language/ethiopic")
 (load "language/european")
 (load "language/czech")

=== modified file 'src/emacs.c'
--- src/emacs.c 2013-10-26 13:43:58 +0000
+++ src/emacs.c 2013-10-27 18:48:51 +0000
@@ -2044,14 +2044,22 @@ You must run Emacs in batch mode in orde

   CHECK_STRING (filename);
   filename = Fexpand_file_name (filename, Qnil);
-  filename = ENCODE_FILE (filename);
+  if (NILP (Vfile_name_coding_system)
+      && NILP (Vdefault_file_name_coding_system))
+    filename = Fstring_to_unibyte (filename);
+  else
+    filename = ENCODE_FILE (filename);
   if (!NILP (symfile))
     {
       CHECK_STRING (symfile);
       if (SCHARS (symfile))
        {
          symfile = Fexpand_file_name (symfile, Qnil);
-         symfile = ENCODE_FILE (symfile);
+         if (NILP (Vfile_name_coding_system)
+             && NILP (Vdefault_file_name_coding_system))
+           symfile = Fstring_to_unibyte (symfile);
+         else
+           symfile = ENCODE_FILE (symfile);
        }
     }






^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-27 19:02                             ` Eli Zaretskii
@ 2013-10-27 19:43                               ` Eli Zaretskii
  0 siblings, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-27 19:43 UTC (permalink / raw)
  To: rgm, monnier, handa; +Cc: 15260

> Date: Sun, 27 Oct 2013 21:02:51 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> After applying these changes, temacs comes up and dumps itself, but
> fails to find simple.el and bytecomp.el when it proceeds to compiling
> Lisp files.                             ^^^^^^^^^^^^^^^^

Instead of "it" I should have written "bootstrap-emacs".  Sorry.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-26  7:50                   ` Eli Zaretskii
  2013-10-26 19:15                     ` Glenn Morris
@ 2013-10-27  4:28                     ` Stefan Monnier
  2013-10-27 16:11                       ` Eli Zaretskii
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2013-10-27  4:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

> But here's where things get hairy: when temacs starts, preloaded Lisp
> files are not yet loaded, and consequently file-name-coding-system and
> default-file-name-coding-system are both nil.  In such a case,
> currently DECODE_FILE is a no-op.

I don't understand why it wouldn't work to just treat those strings as
"binary" (i.e. keep them undecoded in unibyte strings).  Then encoding
would be a noop and that should hence end up in the right byte-sequence
sent to the OS primitives.


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-27  4:28                     ` Stefan Monnier
@ 2013-10-27 16:11                       ` Eli Zaretskii
  2013-10-28  0:30                         ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-27 16:11 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: rgm@gnu.org,  Kenichi Handa <handa@gnu.org>,  15260@debbugs.gnu.org
> Date: Sun, 27 Oct 2013 00:28:36 -0400
> 
> I don't understand why it wouldn't work to just treat those strings as
> "binary" (i.e. keep them undecoded in unibyte strings).  Then encoding
> would be a noop and that should hence end up in the right byte-sequence
> sent to the OS primitives.

Not sure I'm following you here.  I presume you aren't asking why we
generally hold file names in decoded form inside Emacs, nor suggesting
that we switch to storing them as undecoded unibyte strings.

So I guess you are asking why the particular piece of code being
discussed here couldn't keep file names as unibyte strings, is that
your question?

If so, then the answer is "it could, but that would be even more
hair."

The problem is that the code involved in this (specifically,
init_callproc_1, init_callproc, and probably also init_cmdargs and
init_lread) is not something specifically written to stat the
directories from epaths.h and announce their non-existence.  That code
populates important variables with names of files and directories and
lists of directories that are henceforth used in Emacs all over the
place.  Notable examples are data-directory, doc-directory, exec-path,
and load-path.  Without populating these variables, temacs will not
work, and the code which uses these variables assumes their values are
decoded strings.

The error messages are a by-product: as Emacs computes the values of
these variables, it checks the files and directories for existence,
and complains if they don't.  The root cause is that unibyte strings
get stored in variables used by Emacs on the assumption that they are
decoded.

Given the above, I'm not sure exactly what you are suggesting in
practical terms.  Can you elaborate?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-27 16:11                       ` Eli Zaretskii
@ 2013-10-28  0:30                         ` Stefan Monnier
  2013-10-28  3:39                           ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2013-10-28  0:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

> So I guess you are asking why the particular piece of code being
> discussed here couldn't keep file names as unibyte strings, is that
> your question?

IIUC the issue is how to encode when we don't yet have the
coding-systems loaded/setup.  But it seems if we can't encode, then we
can't decode either, so we should just fallback on using unibyte strings
(which shouldn't be encoded on the way back to the OS) for those file
names we create/manipulate before coding-systems are available.

        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-28  0:30                         ` Stefan Monnier
@ 2013-10-28  3:39                           ` Eli Zaretskii
  2013-10-28  4:05                             ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-28  3:39 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: rgm@gnu.org,  handa@gnu.org,  15260@debbugs.gnu.org
> Date: Sun, 27 Oct 2013 20:30:32 -0400
> 
> > So I guess you are asking why the particular piece of code being
> > discussed here couldn't keep file names as unibyte strings, is that
> > your question?
> 
> IIUC the issue is how to encode when we don't yet have the
> coding-systems loaded/setup.  But it seems if we can't encode, then we
> can't decode either, so we should just fallback on using unibyte strings
> (which shouldn't be encoded on the way back to the OS) for those file
> names we create/manipulate before coding-systems are available.

As I explained, this would be even more hair than what I proposed,
because you are talking about core Emacs data structures and variables
that are involved in every file-related op.

On top of that, using unibyte strings is inherently fragile in Emacs,
as the code is not written to support them too well, as you well
know.  We always advise users to stay away of unibyte strings, and for
a good reason, so doing this ourselves sounds unwise.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-28  3:39                           ` Eli Zaretskii
@ 2013-10-28  4:05                             ` Stefan Monnier
  2013-10-28 16:47                               ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2013-10-28  4:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

> As I explained, this would be even more hair than what I proposed,
> because you are talking about core Emacs data structures and variables
> that are involved in every file-related op.

> On top of that, using unibyte strings is inherently fragile in Emacs,
> as the code is not written to support them too well, as you well
> know.  We always advise users to stay away of unibyte strings, and for
> a good reason, so doing this ourselves sounds unwise.

I know, but I'm not sure why it doesn't "just work".

More specifically, for the bug to appear, you need ENCODE (DECODE (s))
to not be the identity function.  Why is not so in the "early" Emacs?


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-28  4:05                             ` Stefan Monnier
@ 2013-10-28 16:47                               ` Eli Zaretskii
  2013-10-28 18:33                                 ` Eli Zaretskii
  2013-10-31 21:45                                 ` Glenn Morris
  0 siblings, 2 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-28 16:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: rgm@gnu.org,  handa@gnu.org,  15260@debbugs.gnu.org
> Date: Mon, 28 Oct 2013 00:05:32 -0400
> 
> More specifically, for the bug to appear, you need ENCODE (DECODE (s))
> to not be the identity function.  Why is not so in the "early" Emacs?

Because life's a mess that doesn't easily fit into simple and elegant
schemes ;-)

For starters, we don't really DECODE_FILE with these file- and
directory-names.  We just use build_string or make_string, as you can
easily see in the init_* functions I mentioned.  If you are lucky and
your file names are UTF-8 encoded, this produces the same result as
DECODE_FILE.  If you are less lucky, and your file names are encoded
in something else, like Latin-N, you get a unibyte string with the
same bytes as in the original.  Then we pass these strings to various
functions, like file_accessible_directory_p, that _do_ ENCODE_FILE...
(Luckily, during most of temacs's run, both file-name-coding-system
and its default value are nil, so ENCODE_FILE is a no-op -- except
when they aren't, see the next paragraph.)

Next, it is quite possible that the file-name-coding-system changes
between the time we process and store the file name and the time we
encode and pass it to a low-level function.  This is especially true
during "loadup", when many packages are loaded and their top-level
forms are executed.  It turns out that 2 of them have side effects
that do just that: mule-cmds.el calls reset-language-environment, and
language/english.el calls set-language-info-alist; both have the
effect of resetting default-file-name-coding-system to latin-1 (!? an
interesting "default" for a Unicode-era Emacs, perhaps Handa-san could
comment why we still do that).  When this happens, your symmetry is
broken, and ENCODE_FILE (DECODE_FILE (f)) is no longer the identity
function.

And then there are other players in this game.  For example,
default-directory, which is used every time we call expand-file-name,
IOW "a lot".  If you look in init_buffer, you will see that the
default-directory of *scratch* is first set to a multibyte
representation of the unibyte string we get from getcwd.  In a
"normal" Emacs session, we promptly fix that in startup.el, after the
call to set-locale-environment initializes all the coding-systems.
But "temacs -l loadup dump" doesn't run startup.el, so we are left
with what init_buffer did, which is a string no file-name API will be
able to grok.

Another example is the use of 'equal' (and 'member', which calls
'equal') to compare file and directory names, and look them up in
lists: as you know, 'equal' will not compare a unibyte and a multibyte
string as equal.  So having a mix of unibyte and multibyte strings in
file names fails some of the code that relies on 'equal', tricking it
into doing wrong things, like deciding that Emacs is _not_ run from
the source tree.

I'm sure there's more to this saga, I'm just half-way through it...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-28 16:47                               ` Eli Zaretskii
@ 2013-10-28 18:33                                 ` Eli Zaretskii
  2013-10-28 22:00                                   ` Glenn Morris
  2013-10-29  1:35                                   ` Stefan Monnier
  2013-10-31 21:45                                 ` Glenn Morris
  1 sibling, 2 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-28 18:33 UTC (permalink / raw)
  To: monnier, Kenichi Handa; +Cc: 15260

> Date: Mon, 28 Oct 2013 18:47:32 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> I'm sure there's more to this saga, I'm just half-way through it...

The next round is here:

  # The actual Emacs command run in the targets below.

  emacs = EMACSLOADPATH="$(abs_lisp)" LC_ALL=C "$(EMACS)" $(EMACSOPT)
                                      ^^^^^^^^

Does anyone know or remember why we set LC_ALL=C while running
commands in lisp/ (and the same in leim/)?  The following log entry
in lisp/ChangeLog.13 is the only clue:

  2008-02-01  Kenichi Handa  <handa@etl.go.jp>

	  * Makefile.in: Be sure to run emacs with LC_ALL=C.

But there's no explanation as to why this is needed.

What this does is prevent bootstrap-emacs from finding Lisp files,
because LC_ALL=C implies -- you guessed it -- file-name encoding by
Latin-1, whereas the file names are really encoded in UTF-8 on this
system:

  cd ../lisp; make -w compile-first EMACS="/home/e/eliz/bzr/emacs/xáéçö/src/bootstrap-emacs"
  make[2]: Entering directory `/srv/data/home/e/eliz/bzr/emacs/xáéçö/lisp'
  Compiling emacs-lisp/macroexp.el
  Warning: Could not find simple.el or simple.elc
  The EMACSLOADPATH environment variable is set, please check its value
  Lisp directory /home/e/eliz/bzr/emacs/x<E1><E9><E7><F6>/lisp not readable?

If I remove the LC_ALL=C setting from lisp/Makefile.in and
leim/Makefile.in, I get past this problem (to the next one ;-).

So: any reasons not to remove this setting from lisp/Makefile.in and
leim/Makefile.in?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-28 18:33                                 ` Eli Zaretskii
@ 2013-10-28 22:00                                   ` Glenn Morris
  2013-10-29  3:42                                     ` Eli Zaretskii
  2013-10-29  1:35                                   ` Stefan Monnier
  1 sibling, 1 reply; 50+ messages in thread
From: Glenn Morris @ 2013-10-28 22:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

Eli Zaretskii wrote:

>   emacs = EMACSLOADPATH="$(abs_lisp)" LC_ALL=C "$(EMACS)" $(EMACSOPT)
>                                       ^^^^^^^^
>
> Does anyone know or remember why we set LC_ALL=C while running
> commands in lisp/ (and the same in leim/)? 

FWIW, if I change that to use LC_ALL=en_US.UTF8 and bootstrap (after
also fixing cl--gensym-counter to a non-random default), all the
resulting *.elc files are identical to the LC=ALL=C case.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-28 22:00                                   ` Glenn Morris
@ 2013-10-29  3:42                                     ` Eli Zaretskii
  0 siblings, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-29  3:42 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15260

> From: Glenn Morris <rgm@gnu.org>
> Cc: monnier@iro.umontreal.ca,  Kenichi Handa <handa@gnu.org>,  15260@debbugs.gnu.org
> Date: Mon, 28 Oct 2013 18:00:25 -0400
> 
> Eli Zaretskii wrote:
> 
> >   emacs = EMACSLOADPATH="$(abs_lisp)" LC_ALL=C "$(EMACS)" $(EMACSOPT)
> >                                       ^^^^^^^^
> >
> > Does anyone know or remember why we set LC_ALL=C while running
> > commands in lisp/ (and the same in leim/)? 
> 
> FWIW, if I change that to use LC_ALL=en_US.UTF8 and bootstrap (after
> also fixing cl--gensym-counter to a non-random default), all the
> resulting *.elc files are identical to the LC=ALL=C case.

Thanks.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-28 18:33                                 ` Eli Zaretskii
  2013-10-28 22:00                                   ` Glenn Morris
@ 2013-10-29  1:35                                   ` Stefan Monnier
  2013-10-29  3:47                                     ` Eli Zaretskii
  2013-11-01 13:58                                     ` Kenichi Handa
  1 sibling, 2 replies; 50+ messages in thread
From: Stefan Monnier @ 2013-10-29  1:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

>   emacs = EMACSLOADPATH="$(abs_lisp)" LC_ALL=C "$(EMACS)" $(EMACSOPT)
>                                       ^^^^^^^^

> Does anyone know or remember why we set LC_ALL=C while running
> commands in lisp/ (and the same in leim/)?

IIRC the issue was to avoid things like misdetecting coding-systems
because of the user's locale setting, in the files we load/compile.

IOW, it was to work around bugs (e.g. missing coding: cookie) and is
likely unneeded nowadays.


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-29  1:35                                   ` Stefan Monnier
@ 2013-10-29  3:47                                     ` Eli Zaretskii
  2013-10-29 13:56                                       ` Stefan Monnier
  2013-11-01 13:58                                     ` Kenichi Handa
  1 sibling, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-29  3:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Kenichi Handa <handa@gnu.org>,  15260@debbugs.gnu.org
> Date: Mon, 28 Oct 2013 21:35:00 -0400
> 
> >   emacs = EMACSLOADPATH="$(abs_lisp)" LC_ALL=C "$(EMACS)" $(EMACSOPT)
> >                                       ^^^^^^^^
> 
> > Does anyone know or remember why we set LC_ALL=C while running
> > commands in lisp/ (and the same in leim/)?
> 
> IIRC the issue was to avoid things like misdetecting coding-systems
> because of the user's locale setting, in the files we load/compile.

Makes sense.

> IOW, it was to work around bugs (e.g. missing coding: cookie) and is
> likely unneeded nowadays.

Right, so the way should be clear to remove these.

Thanks.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-29  3:47                                     ` Eli Zaretskii
@ 2013-10-29 13:56                                       ` Stefan Monnier
  2013-10-30 18:19                                         ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2013-10-29 13:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

>> IOW, it was to work around bugs (e.g. missing coding: cookie) and is
>> likely unneeded nowadays.
> Right, so the way should be clear to remove these.

There may still be problems laying dormant, but we should be able to fix
them if/when they show up,


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-29 13:56                                       ` Stefan Monnier
@ 2013-10-30 18:19                                         ` Eli Zaretskii
  2013-10-31  1:01                                           ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-30 18:19 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 15260

I bumped into this as part of digging into this bug report: there's a
strange inconsistency between make_string and string_to_multibyte (or
maybe it's just that the "multibyte" part of the name is overloaded).

Specifically, if you pass a unibyte string through string_to_multibyte,
it will produce a multibyte string, as expected.  But if SDATA of the
resulting multibyte string, or any of its derivatives, is passed to
make_string, the latter will decide that it must make a unibyte string!
This is because parse_str_as_multibyte, called internally by
make_string, considers the multibyte representation of 8-bit bytes as
a sign that the string produced from these bytes must be unibyte.

Why do we have this confusing inconsistency?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-30 18:19                                         ` Eli Zaretskii
@ 2013-10-31  1:01                                           ` Stefan Monnier
  2013-10-31  3:47                                             ` Eli Zaretskii
  2013-10-31 17:16                                             ` Eli Zaretskii
  0 siblings, 2 replies; 50+ messages in thread
From: Stefan Monnier @ 2013-10-31  1:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

> Why do we have this confusing inconsistency?

make_string is a bug.  There's no way to know/guess if the string should
be unibyte or multibyte.  So, it should be removed and replaced by calls
to either make_unibyte_string or make_multibyte_string.


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31  1:01                                           ` Stefan Monnier
@ 2013-10-31  3:47                                             ` Eli Zaretskii
  2013-10-31 13:40                                               ` Stefan Monnier
  2013-10-31 17:59                                               ` Eli Zaretskii
  2013-10-31 17:16                                             ` Eli Zaretskii
  1 sibling, 2 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-31  3:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Kenichi Handa <handa@gnu.org>,  15260@debbugs.gnu.org
> Date: Wed, 30 Oct 2013 21:01:21 -0400
> 
> > Why do we have this confusing inconsistency?
> 
> make_string is a bug.  There's no way to know/guess if the string should
> be unibyte or multibyte.

Well, there is a way, but it's tricky ;-)

Yes, this inconsistency caused me a lot of grief while working on this
bug.

> So, it should be removed and replaced by calls to either
> make_unibyte_string or make_multibyte_string.

I presume you think the same about build_string, then.

By sheer luck (or maybe something else), this is exactly what I've
been doing in every case where it mattered for the non-ASCII build.

(The job is almost done, btw, I'm in final testing.)





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31  3:47                                             ` Eli Zaretskii
@ 2013-10-31 13:40                                               ` Stefan Monnier
  2013-10-31 16:25                                                 ` Eli Zaretskii
  2013-10-31 17:59                                               ` Eli Zaretskii
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2013-10-31 13:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

>> So, it should be removed and replaced by calls to either
>> make_unibyte_string or make_multibyte_string.
> I presume you think the same about build_string, then.

Pretty much, except that I tend to think of build_string as only meant
for use on string constants, which are all ASCII, normally, so it
doesn't matter nearly as much.


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 13:40                                               ` Stefan Monnier
@ 2013-10-31 16:25                                                 ` Eli Zaretskii
  2013-10-31 18:04                                                   ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-31 16:25 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: handa@gnu.org, 15260@debbugs.gnu.org
> Date: Thu, 31 Oct 2013 09:40:07 -0400
> 
> >> So, it should be removed and replaced by calls to either
> >> make_unibyte_string or make_multibyte_string.
> > I presume you think the same about build_string, then.
> 
> Pretty much, except that I tend to think of build_string as only meant
> for use on string constants, which are all ASCII, normally, so it
> doesn't matter nearly as much.

About 20% of uses of build_string are not guaranteed to act on pure
ASCII strings.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 16:25                                                 ` Eli Zaretskii
@ 2013-10-31 18:04                                                   ` Stefan Monnier
  0 siblings, 0 replies; 50+ messages in thread
From: Stefan Monnier @ 2013-10-31 18:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

>> >> So, it should be removed and replaced by calls to either
>> >> make_unibyte_string or make_multibyte_string.
>> > I presume you think the same about build_string, then.
>> Pretty much, except that I tend to think of build_string as only meant
>> for use on string constants, which are all ASCII, normally, so it
>> doesn't matter nearly as much.
> About 20% of uses of build_string are not guaranteed to act on pure
> ASCII strings.

These should presumably use something like make_unibyte_string or
make_multibyte_string, then.


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31  3:47                                             ` Eli Zaretskii
  2013-10-31 13:40                                               ` Stefan Monnier
@ 2013-10-31 17:59                                               ` Eli Zaretskii
  2013-10-31 19:24                                                 ` Stefan Monnier
  2013-11-04 17:35                                                 ` Eli Zaretskii
  1 sibling, 2 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-31 17:59 UTC (permalink / raw)
  To: Stefan Monnier, Glenn Morris; +Cc: 15260

> Date: Thu, 31 Oct 2013 05:47:48 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> (The job is almost done, btw, I'm in final testing.)

Below is what I came up with.  This survived several bootstraps, both
on GNU/Linux (in- and out-of-source-tree builds) and on MS-Windows,
including "make install" into a non-ASCII directory and invocation
from there.

To summarize the changes:

  . ENCODE_FILE now explicitly leaves alone unibyte strings.  (It
    could be that it did this before as well, but I couldn't find the
    code which had that effect, and doing that early on is TRT
    anyway.)

  . All the *-directory and *-path variables we create at startup are
    forced to be unibyte strings.  (Previously, they were sometimes
    unibyte and sometimes multibyte.)  In temacs that dumps or
    bootstraps, they stay unibyte all the way till program exit.  In a
    session that runs startup.el, they are decoded early in
    normal-top-level, after setting up the locale environment.  The
    code which decodes these was moved much closer to the beginning of
    normal-top-level, as its previous place was too late, after some
    damage was already done.

  . Several left-overs from working around problems that are long gone
    are removed.  Notable examples: (a) "set LC_ALL=C" when running
    'emacs' (NOT 'temacs') in lisp/ and leim/; (b) storing (in
    init_buffer) the default-directory of *scratch* in the multibyte
    representation of the original unibyte bytes.

  . A few related bugs which got in the way were fixed.  E.g., in
    make_temp_name.

Please test this, as I could only do that in 2 different locales.

Please pay specific attention to strings in load-path,
default-directory, data-directory, etc., after Emacs comes up in an
interactive session: they should all be multibyte strings (you can use
multibyte-string-p to test that).

If no problems pop up, I will commit this in a few days.

Thanks.

=== modified file 'configure.ac'
--- configure.ac	2013-10-27 18:57:20 +0000
+++ configure.ac	2013-10-31 16:57:18 +0000
@@ -73,30 +73,6 @@ dnl Support for --program-prefix, --prog
 dnl --program-transform-name options
 AC_ARG_PROGRAM
 
-dnl http://debbugs.gnu.org/15260
-dnl I think we have to check, eg, both exec_prefix and bindir,
-dnl because the latter by default is not yet expanded, but the user
-dnl may have specified a value for it via --bindir.
-dnl At first glance, _installing_ in non-ASCII seems ok, but in fact
-dnl it is not; see http://debbugs.gnu.org/15260#61
-dnl Note that abs_srcdir and abs_builddir are not yet defined. :(
-
-dnl "`cd \"$srcdir\"`" is not portable.
-dnl See autoconf manual "Shell Substitutions":
-dnl "There is just no portable way to use double-quoted strings inside
-dnl double-quoted back-quoted expressions (pfew!)."
-temp_srcdir=`cd "$srcdir"; pwd`
-
-for var in "`pwd`" "$temp_srcdir" "$prefix" "$exec_prefix" \
-    "$datarootdir" "$bindir" "$datadir" "$sharedstatedir" "$libexecdir"; do
-
-  dnl configure sets LC_ALL=C early on, so this range should work.
-  case "$var" in
-    *[[^\ -~]]*) AC_MSG_ERROR([Emacs cannot be built or installed in a directory whose name contains non-ASCII characters: $var]) ;;
-  esac
-
-done
-
 dnl It is important that variables on the RHS not be expanded here,
 dnl hence the single quotes.  This is per the GNU coding standards, see
 dnl (autoconf) Installation Directory Variables

=== modified file 'leim/Makefile.in'
--- leim/Makefile.in	2013-10-24 02:29:29 +0000
+++ leim/Makefile.in	2013-10-31 16:57:18 +0000
@@ -34,7 +34,7 @@ EMACS = ../src/emacs
 buildlisppath=${abs_srcdir}/../lisp
 
 # How to run Emacs.
-RUN_EMACS = EMACSLOADPATH="$(buildlisppath)" LC_ALL=C \
+RUN_EMACS = EMACSLOADPATH="$(buildlisppath)" \
 	"${EMACS}" -batch --no-site-file --no-site-lisp
 
 MKDIR_P = @MKDIR_P@

=== modified file 'lisp/Makefile.in'
--- lisp/Makefile.in	2013-10-31 07:27:35 +0000
+++ lisp/Makefile.in	2013-10-31 16:57:18 +0000
@@ -115,7 +115,7 @@ COMPILE_FIRST = \
 
 # The actual Emacs command run in the targets below.
 
-emacs = EMACSLOADPATH="$(abs_lisp)" LC_ALL=C "$(EMACS)" $(EMACSOPT)
+emacs = EMACSLOADPATH="$(abs_lisp)" "$(EMACS)" $(EMACSOPT)
 
 # Common command to find subdirectories
 setwins=subdirs=`find . -type d -print`; \

=== modified file 'lisp/loadup.el'
--- lisp/loadup.el	2013-10-08 15:11:29 +0000
+++ lisp/loadup.el	2013-10-31 16:57:18 +0000
@@ -286,6 +286,20 @@
 ;For other systems, you must edit ../src/Makefile.in.
 (load "site-load" t)
 
+;; Make sure default-directory is unibyte when dumping.  This is
+;; because we cannot decode and encode it correctly (since the locale
+;; environment is not, and should not be, set up).  default-directory
+;; is used every time we call expand-file-name, which we do in every
+;; file primitive.  So the only workable solution to support building
+;; in non-ASCII directories is to manipulate unibyte strings in the
+;; current locale's encoding.
+(if (and (or (equal (nth 3 command-line-args) "dump")
+	     (equal (nth 4 command-line-args) "dump")
+	     (equal (nth 3 command-line-args) "bootstrap")
+	     (equal (nth 4 command-line-args) "bootstrap"))
+	 (multibyte-string-p default-directory))
+    (setq default-directory (string-to-unibyte default-directory)))
+
 ;; Determine which last version number to use
 ;; based on the executables that now exist.
 (if (and (or (equal (nth 3 command-line-args) "dump")

=== modified file 'lisp/startup.el'
--- lisp/startup.el	2013-10-30 02:45:53 +0000
+++ lisp/startup.el	2013-10-31 17:04:11 +0000
@@ -489,6 +489,63 @@ It is the default value of the variable 
   (if command-line-processed
       (message "Back to top level.")
     (setq command-line-processed t)
+
+    ;; Set the default strings to display in mode line for end-of-line
+    ;; formats that aren't native to this platform.  This should be
+    ;; done before calling set-locale-environment, as the latter might
+    ;; use these mnemonics.
+    (cond
+     ((memq system-type '(ms-dos windows-nt))
+      (setq eol-mnemonic-unix "(Unix)"
+	    eol-mnemonic-mac  "(Mac)"))
+     (t                                   ; this is for Unix/GNU/Linux systems
+      (setq eol-mnemonic-dos  "(DOS)"
+	    eol-mnemonic-mac  "(Mac)")))
+
+    (set-locale-environment nil)
+    ;; Decode all default-directory's (probably, only *scratch* exists
+    ;; at this point).  default-directory of *scratch* is the basis
+    ;; for many other file-name variables and directory lists, so it
+    ;; is important to decode it ASAP.
+    (when locale-coding-system
+      (save-excursion
+	(dolist (elt (buffer-list))
+	  (set-buffer elt)
+	  (if default-directory
+	      (setq default-directory
+		    (decode-coding-string default-directory
+					  locale-coding-system t)))))
+
+      ;; Decode all the important variables and directory lists, now
+      ;; that we know the locale's encoding.  This is because the
+      ;; values of these variables are until here unibyte undecoded
+      ;; strings created by build_unibyte_string.  data-directory in
+      ;; particular is used to construct many other standard directory
+      ;; names, so it must be decoded ASAP.
+      ;; Note that charset-map-path cannot be decoded here, since we
+      ;; could then be trapped in infinite recursion below, when we
+      ;; load subdirs.el, because encoding a directory name might need
+      ;; to load a charset map, which will want to encode
+      ;; charset-map-path, which will want to load the same charset
+      ;; map...  So decoding of charset-map-path is delayed until
+      ;; further down below.
+      (dolist (pathsym '(load-path exec-path))
+	(let ((path (symbol-value pathsym)))
+	  (if (listp path)
+	      (set pathsym (mapcar (lambda (dir)
+				     (decode-coding-string
+				      dir
+				      locale-coding-system t))
+				path)))))
+      (dolist (filesym '(data-directory doc-directory exec-directory
+					installation-directory
+					invocation-directory invocation-name
+					source-directory
+					shared-game-score-directory))
+	(let ((file (symbol-value filesym)))
+	  (if (stringp file)
+	      (set filesym (decode-coding-string file locale-coding-system t))))))
+
     (let ((dir default-directory))
       (with-current-buffer "*Messages*"
         (messages-buffer-mode)
@@ -536,6 +593,16 @@ It is the default value of the variable 
 	       (setq process-environment
 		     (delete (concat "PWD=" pwd)
 			     process-environment)))))
+    ;; Now, that other directories were searched, and any charsets we
+    ;; need for encoding them are already loaded, we are ready to
+    ;; decode charset-map-path.
+    (if (listp charset-map-path)
+	(setq charset-map-path
+	      (mapcar (lambda (dir)
+			(decode-coding-string
+			 dir
+			 locale-coding-system t))
+		      charset-map-path)))
     (setq default-directory (abbreviate-file-name default-directory))
     (let ((old-face-font-rescale-alist face-font-rescale-alist))
       (unwind-protect
@@ -756,18 +823,6 @@ Amongst another things, it parses the co
   ;;! ;; Choose a good default value for split-window-keep-point.
   ;;! (setq split-window-keep-point (> baud-rate 2400))
 
-  ;; Set the default strings to display in mode line for
-  ;; end-of-line formats that aren't native to this platform.
-  (cond
-   ((memq system-type '(ms-dos windows-nt))
-    (setq eol-mnemonic-unix "(Unix)"
-          eol-mnemonic-mac  "(Mac)"))
-   (t                                   ; this is for Unix/GNU/Linux systems
-    (setq eol-mnemonic-dos  "(DOS)"
-          eol-mnemonic-mac  "(Mac)")))
-
-  (set-locale-environment nil)
-
   ;; Convert preloaded file names in load-history to absolute.
   (let ((simple-file-name
 	 ;; Look for simple.el or simple.elc and use their directory
@@ -801,7 +856,7 @@ please check its value")
 		    load-history))))
 
   ;; Convert the arguments to Emacs internal representation.
-  (let ((args (cdr command-line-args)))
+  (let ((args command-line-args))
     (while args
       (setcar args
 	      (decode-coding-string (car args) locale-coding-system t))
@@ -1211,19 +1266,6 @@ the `--debug-init' option to view a comp
   (setq after-init-time (current-time))
   (run-hooks 'after-init-hook)
 
-  ;; Decode all default-directory.
-  (if (and (default-value 'enable-multibyte-characters) locale-coding-system)
-      (save-excursion
-	(dolist (elt (buffer-list))
-	  (set-buffer elt)
-	  (if default-directory
-	      (setq default-directory
-		    (decode-coding-string default-directory
-					  locale-coding-system t))))
-	(setq command-line-default-directory
-	      (decode-coding-string command-line-default-directory
-				    locale-coding-system t))))
-
   ;; If *scratch* exists and init file didn't change its mode, initialize it.
   (if (get-buffer "*scratch*")
       (with-current-buffer "*scratch*"

=== modified file 'src/buffer.c'
--- src/buffer.c	2013-10-29 14:46:23 +0000
+++ src/buffer.c	2013-10-31 16:57:18 +0000
@@ -5349,13 +5349,10 @@ init_buffer (void)
       len++;
     }
 
+  /* At this moment, we still don't know how to decode the directory
+     name.  So, we keep the bytes in unibyte form so that file I/O
+     routines correctly get the original bytes.  */
   bset_directory (current_buffer, make_unibyte_string (pwd, len));
-  if (! NILP (BVAR (&buffer_defaults, enable_multibyte_characters)))
-    /* At this moment, we still don't know how to decode the
-       directory name.  So, we keep the bytes in multibyte form so
-       that ENCODE_FILE correctly gets the original bytes.  */
-    bset_directory
-      (current_buffer, string_to_multibyte (BVAR (current_buffer, directory)));
 
   /* Add /: to the front of the name
      if it would otherwise be treated as magic.  */

=== modified file 'src/callproc.c'
--- src/callproc.c	2013-08-23 17:57:07 +0000
+++ src/callproc.c	2013-10-31 16:57:18 +0000
@@ -1612,14 +1612,14 @@ init_callproc (void)
       Lisp_Object tem, tem1, srcdir;
 
       srcdir = Fexpand_file_name (build_string ("../src/"),
-				  build_string (PATH_DUMPLOADSEARCH));
+				  build_unibyte_string (PATH_DUMPLOADSEARCH));
       tem = Fexpand_file_name (build_string ("GNU"), Vdata_directory);
       tem1 = Ffile_exists_p (tem);
       if (!NILP (Fequal (srcdir, Vinvocation_directory)) || NILP (tem1))
 	{
 	  Lisp_Object newdir;
 	  newdir = Fexpand_file_name (build_string ("../etc/"),
-				      build_string (PATH_DUMPLOADSEARCH));
+				      build_unibyte_string (PATH_DUMPLOADSEARCH));
 	  tem = Fexpand_file_name (build_string ("GNU"), newdir);
 	  tem1 = Ffile_exists_p (tem);
 	  if (!NILP (tem1))
@@ -1646,7 +1646,7 @@ init_callproc (void)
 #ifdef DOS_NT
   Vshared_game_score_directory = Qnil;
 #else
-  Vshared_game_score_directory = build_string (PATH_GAME);
+  Vshared_game_score_directory = build_unibyte_string (PATH_GAME);
   if (NILP (Ffile_accessible_directory_p (Vshared_game_score_directory)))
     Vshared_game_score_directory = Qnil;
 #endif

=== modified file 'src/coding.h'
--- src/coding.h	2013-10-08 06:40:09 +0000
+++ src/coding.h	2013-10-31 16:57:18 +0000
@@ -670,14 +670,16 @@ struct coding_system
     (code) = (s1 << 8) | s2;				\
   } while (0)
 
-/* Encode the file name NAME using the specified coding system
-   for file names, if any.  */
-#define ENCODE_FILE(name)						   \
-  (! NILP (Vfile_name_coding_system)					   \
-   ? code_convert_string_norecord (name, Vfile_name_coding_system, 1)	   \
-   : (! NILP (Vdefault_file_name_coding_system)				   \
-      ? code_convert_string_norecord (name, Vdefault_file_name_coding_system, 1) \
-      : name))
+/* Encode the file name NAME using the specified coding system for
+   file names, if any.  If NAME is a unibyte string, return NAME.  */
+#define ENCODE_FILE(name)						\
+    (! STRING_MULTIBYTE (name)						\
+     ? name								\
+     : (! NILP (Vfile_name_coding_system)				\
+	? code_convert_string_norecord (name, Vfile_name_coding_system, 1) \
+	: (! NILP (Vdefault_file_name_coding_system)			\
+	   ? code_convert_string_norecord (name, Vdefault_file_name_coding_system, 1) \
+	   : name)))
 
 
 /* Decode the file name NAME using the specified coding system

=== modified file 'src/emacs.c'
--- src/emacs.c	2013-10-31 08:32:42 +0000
+++ src/emacs.c	2013-10-31 17:03:39 +0000
@@ -396,7 +396,7 @@ init_cmdargs (int argc, char **argv, int
   initial_argv = argv;
   initial_argc = argc;
 
-  raw_name = build_string (argv[0]);
+  raw_name = build_unibyte_string (argv[0]);
 
   /* Add /: to the front of the name
      if it would otherwise be treated as magic.  */
@@ -430,7 +430,9 @@ init_cmdargs (int argc, char **argv, int
     /* Emacs was started with relative path, like ./emacs.
        Make it absolute.  */
     {
-      Lisp_Object odir = original_pwd ? build_string (original_pwd) : Qnil;
+      Lisp_Object odir =
+	original_pwd ? build_unibyte_string (original_pwd) : Qnil;
+
       Vinvocation_directory = Fexpand_file_name (Vinvocation_directory, odir);
     }
 
@@ -2204,7 +2206,7 @@ decode_env_path (const char *evarname, c
       p = strchr (path, SEPCHAR);
       if (!p)
 	p = path + strlen (path);
-      element = (p - path ? make_string (path, p - path)
+      element = (p - path ? make_unibyte_string (path, p - path)
 		 : build_string ("."));
 #ifdef WINDOWSNT
       /* Relative file names in the default path are interpreted as
@@ -2214,7 +2216,7 @@ decode_env_path (const char *evarname, c
 	element = Fexpand_file_name (Fsubstring (element,
 						 make_number (emacs_dir_len),
 						 Qnil),
-				     build_string (emacs_dir));
+				     build_unibyte_string (emacs_dir));
 #endif
 
       /* Add /: to the front of the name

=== modified file 'src/fileio.c'
--- src/fileio.c	2013-10-17 06:42:21 +0000
+++ src/fileio.c	2013-10-31 16:57:18 +0000
@@ -732,8 +732,8 @@ static unsigned make_temp_name_count, ma
 Lisp_Object
 make_temp_name (Lisp_Object prefix, bool base64_p)
 {
-  Lisp_Object val;
-  int len, clen;
+  Lisp_Object val, encoded_prefix;
+  int len;
   printmax_t pid;
   char *p, *data;
   char pidbuf[INT_BUFSIZE_BOUND (printmax_t)];
@@ -767,12 +767,11 @@ make_temp_name (Lisp_Object prefix, bool
 #endif
     }
 
-  len = SBYTES (prefix); clen = SCHARS (prefix);
-  val = make_uninit_multibyte_string (clen + 3 + pidlen, len + 3 + pidlen);
-  if (!STRING_MULTIBYTE (prefix))
-    STRING_SET_UNIBYTE (val);
+  encoded_prefix = ENCODE_FILE (prefix);
+  len = SBYTES (encoded_prefix);
+  val = make_uninit_string (len + 3 + pidlen);
   data = SSDATA (val);
-  memcpy (data, SSDATA (prefix), len);
+  memcpy (data, SSDATA (encoded_prefix), len);
   p = data + len;
 
   memcpy (p, pidbuf, pidlen);
@@ -810,7 +809,7 @@ make_temp_name (Lisp_Object prefix, bool
 	{
 	  /* We want to return only if errno is ENOENT.  */
 	  if (errno == ENOENT)
-	    return val;
+	    return DECODE_FILE (val);
 	  else
 	    /* The error here is dubious, but there is little else we
 	       can do.  The alternatives are to return nil, which is
@@ -987,7 +986,26 @@ filesystem tree, not (expand-file-name "
   if (multibyte != STRING_MULTIBYTE (default_directory))
     {
       if (multibyte)
-	default_directory = string_to_multibyte (default_directory);
+	{
+	  unsigned char *p = SDATA (name);
+
+	  while (*p && ASCII_BYTE_P (*p))
+	    p++;
+	  if (*p == '\0')
+	    {
+	      /* NAME is a pure ASCII string, and DEFAULT_DIRECTORY is
+		 unibyte.  Do not convert DEFAULT_DIRECTORY to
+		 multibyte; instead, convert NAME to a unibyte string,
+		 so that the result of this function is also a unibyte
+		 string.  This is needed during bootstraping and
+		 dumping, when Emacs cannot decode file names, because
+		 the locale environment is not set up.  */
+	      name = make_unibyte_string (SSDATA (name), SBYTES (name));
+	      multibyte = 0;
+	    }
+	  else
+	    default_directory = string_to_multibyte (default_directory);
+	}
       else
 	{
 	  name = string_to_multibyte (name);

=== modified file 'src/lread.c'
--- src/lread.c	2013-09-26 03:46:47 +0000
+++ src/lread.c	2013-10-31 16:57:18 +0000
@@ -1500,7 +1500,8 @@ openp (Lisp_Object path, Lisp_Object str
       for (tail = NILP (suffixes) ? list1 (empty_unibyte_string) : suffixes;
 	   CONSP (tail); tail = XCDR (tail))
 	{
-	  ptrdiff_t fnlen, lsuffix = SBYTES (XCAR (tail));
+	  Lisp_Object suffix = XCAR (tail);
+	  ptrdiff_t fnlen, lsuffix = SBYTES (suffix);
 	  Lisp_Object handler;
 
 	  /* Concatenate path element/specified name with the suffix.
@@ -1511,7 +1512,7 @@ openp (Lisp_Object path, Lisp_Object str
 			   ? 2 : 0);
 	  fnlen = SBYTES (filename) - prefixlen;
 	  memcpy (fn, SDATA (filename) + prefixlen, fnlen);
-	  memcpy (fn + fnlen, SDATA (XCAR (tail)), lsuffix + 1);
+	  memcpy (fn + fnlen, SDATA (suffix), lsuffix + 1);
 	  fnlen += lsuffix;
 	  /* Check that the file exists and is not a directory.  */
 	  /* We used to only check for handlers on non-absolute file names:
@@ -1521,7 +1522,18 @@ openp (Lisp_Object path, Lisp_Object str
 		  handler = Ffind_file_name_handler (filename, Qfile_exists_p);
 	     It's not clear why that was the case and it breaks things like
 	     (load "/bar.el") where the file is actually "/bar.el.gz".  */
-	  string = make_string (fn, fnlen);
+	  /* make_string has its own ideas on when to return a unibyte
+	     string and when a multibyte string, but we know better.
+	     We must have a unibyte string when dumping, since
+	     file-name encoding is shaky at best at that time, and in
+	     particular default-file-name-coding-system is reset
+	     several times during loadup.  We therefore don't want to
+	     encode the file before passing it to file I/O library
+	     functions.  */
+	  if (!STRING_MULTIBYTE (filename) && !STRING_MULTIBYTE (suffix))
+	    string = make_unibyte_string (fn, fnlen);
+	  else
+	    string = make_string (fn, fnlen);
 	  handler = Ffind_file_name_handler (string, Qfile_exists_p);
 	  if ((!NILP (handler) || !NILP (predicate)) && !NATNUMP (predicate))
             {

=== modified file 'src/xdisp.c'
--- src/xdisp.c	2013-10-29 16:11:50 +0000
+++ src/xdisp.c	2013-10-31 16:57:18 +0000
@@ -9728,7 +9728,11 @@ message3_nolog (Lisp_Object m)
 	putc ('\n', stderr);
       noninteractive_need_newline = 0;
       if (STRINGP (m))
-	fwrite (SDATA (m), SBYTES (m), 1, stderr);
+	{
+	  Lisp_Object s = ENCODE_SYSTEM (m);
+
+	  fwrite (SDATA (s), SBYTES (s), 1, stderr);
+	}
       if (cursor_in_echo_area == 0)
 	fprintf (stderr, "\n");
       fflush (stderr);
@@ -9803,13 +9807,19 @@ message_with_string (const char *m, Lisp
     {
       if (m)
 	{
+	  /* ENCODE_SYSTEM below can GC and/or relocate the Lisp
+	     String whose data pointer might be passed to us in M.  So
+	     we use a local copy.  */
+	  char *fmt = xstrdup (m);
+
 	  if (noninteractive_need_newline)
 	    putc ('\n', stderr);
 	  noninteractive_need_newline = 0;
-	  fprintf (stderr, m, SDATA (string));
+	  fprintf (stderr, fmt, SDATA (ENCODE_SYSTEM (string)));
 	  if (!cursor_in_echo_area)
 	    fprintf (stderr, "\n");
 	  fflush (stderr);
+	  xfree (fmt);
 	}
     }
   else if (INTERACTIVE)






^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 17:59                                               ` Eli Zaretskii
@ 2013-10-31 19:24                                                 ` Stefan Monnier
  2013-10-31 19:33                                                   ` Eli Zaretskii
  2013-11-04 17:35                                                 ` Eli Zaretskii
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2013-10-31 19:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

> Below is what I came up with.  This survived several bootstraps, both

Thanks, Eli.

> +;; Make sure default-directory is unibyte when dumping.  This is
> +;; because we cannot decode and encode it correctly (since the locale
> +;; environment is not, and should not be, set up).  default-directory
> +;; is used every time we call expand-file-name, which we do in every
> +;; file primitive.  So the only workable solution to support building
> +;; in non-ASCII directories is to manipulate unibyte strings in the
> +;; current locale's encoding.
> +(if (and (or (equal (nth 3 command-line-args) "dump")
> +	     (equal (nth 4 command-line-args) "dump")
> +	     (equal (nth 3 command-line-args) "bootstrap")
> +	     (equal (nth 4 command-line-args) "bootstrap"))
> +	 (multibyte-string-p default-directory))
> +    (setq default-directory (string-to-unibyte default-directory)))

I'm not sure I understand this string-to-unibyte.
This call seems to only be correct if default-directory holds the
undecoded but multibyte name.
Why would we have an undecided yet multibyte name?

IOW, I'd expect here to either have default-directory be unibyte
already, or be multibyte but encoded in some (arbitrary) encoding (in
which case we can't really know how to re-encode it).


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 19:24                                                 ` Stefan Monnier
@ 2013-10-31 19:33                                                   ` Eli Zaretskii
  2013-11-01  9:27                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-31 19:33 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Glenn Morris <rgm@gnu.org>,  15260@debbugs.gnu.org,  Kenichi Handa <handa@gnu.org>
> Date: Thu, 31 Oct 2013 15:24:57 -0400
> 
> > Below is what I came up with.  This survived several bootstraps, both
> 
> Thanks, Eli.
> 
> > +;; Make sure default-directory is unibyte when dumping.  This is
> > +;; because we cannot decode and encode it correctly (since the locale
> > +;; environment is not, and should not be, set up).  default-directory
> > +;; is used every time we call expand-file-name, which we do in every
> > +;; file primitive.  So the only workable solution to support building
> > +;; in non-ASCII directories is to manipulate unibyte strings in the
> > +;; current locale's encoding.
> > +(if (and (or (equal (nth 3 command-line-args) "dump")
> > +	     (equal (nth 4 command-line-args) "dump")
> > +	     (equal (nth 3 command-line-args) "bootstrap")
> > +	     (equal (nth 4 command-line-args) "bootstrap"))
> > +	 (multibyte-string-p default-directory))
> > +    (setq default-directory (string-to-unibyte default-directory)))
> 
> I'm not sure I understand this string-to-unibyte.
> This call seems to only be correct if default-directory holds the
> undecoded but multibyte name.
> Why would we have an undecided yet multibyte name?

This was a necessity before I removed this quirk from init_buffer:

--- src/buffer.c	2013-10-29 14:46:23 +0000
+++ src/buffer.c	2013-10-31 16:57:18 +0000
@@ -5349,13 +5349,10 @@ init_buffer (void)
       len++;
     }
 
+  /* At this moment, we still don't know how to decode the directory
+     name.  So, we keep the bytes in unibyte form so that file I/O
+     routines correctly get the original bytes.  */
   bset_directory (current_buffer, make_unibyte_string (pwd, len));
-  if (! NILP (BVAR (&buffer_defaults, enable_multibyte_characters)))
-    /* At this moment, we still don't know how to decode the
-       directory name.  So, we keep the bytes in multibyte form so
-       that ENCODE_FILE correctly gets the original bytes.  */
-    bset_directory
-      (current_buffer, string_to_multibyte (BVAR (current_buffer, directory)));
 
   /* Add /: to the front of the name
      if it would otherwise be treated as magic.  */

After removing that, it's probably not needed anymore, since now
default-directory should be a unibyte string from the very beginning.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 19:33                                                   ` Eli Zaretskii
@ 2013-11-01  9:27                                                     ` Eli Zaretskii
  2013-11-01 12:33                                                       ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-11-01  9:27 UTC (permalink / raw)
  To: monnier; +Cc: 15260

> Date: Thu, 31 Oct 2013 21:33:22 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> > > +;; Make sure default-directory is unibyte when dumping.  This is
> > > +;; because we cannot decode and encode it correctly (since the locale
> > > +;; environment is not, and should not be, set up).  default-directory
> > > +;; is used every time we call expand-file-name, which we do in every
> > > +;; file primitive.  So the only workable solution to support building
> > > +;; in non-ASCII directories is to manipulate unibyte strings in the
> > > +;; current locale's encoding.
> > > +(if (and (or (equal (nth 3 command-line-args) "dump")
> > > +	     (equal (nth 4 command-line-args) "dump")
> > > +	     (equal (nth 3 command-line-args) "bootstrap")
> > > +	     (equal (nth 4 command-line-args) "bootstrap"))
> > > +	 (multibyte-string-p default-directory))
> > > +    (setq default-directory (string-to-unibyte default-directory)))
> > 
> > I'm not sure I understand this string-to-unibyte.
> > This call seems to only be correct if default-directory holds the
> > undecoded but multibyte name.
> > Why would we have an undecided yet multibyte name?
> 
> This was a necessity before I removed this quirk from init_buffer:
> 
> --- src/buffer.c	2013-10-29 14:46:23 +0000
> +++ src/buffer.c	2013-10-31 16:57:18 +0000
> @@ -5349,13 +5349,10 @@ init_buffer (void)
>        len++;
>      }
>  
> +  /* At this moment, we still don't know how to decode the directory
> +     name.  So, we keep the bytes in unibyte form so that file I/O
> +     routines correctly get the original bytes.  */
>    bset_directory (current_buffer, make_unibyte_string (pwd, len));
> -  if (! NILP (BVAR (&buffer_defaults, enable_multibyte_characters)))
> -    /* At this moment, we still don't know how to decode the
> -       directory name.  So, we keep the bytes in multibyte form so
> -       that ENCODE_FILE correctly gets the original bytes.  */
> -    bset_directory
> -      (current_buffer, string_to_multibyte (BVAR (current_buffer, directory)));
>  
>    /* Add /: to the front of the name
>       if it would otherwise be treated as magic.  */
> 
> After removing that, it's probably not needed anymore, since now
> default-directory should be a unibyte string from the very beginning.

Would you prefer that we error out of default-directory is not a
unibyte string at that point in loadup.el?





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-11-01  9:27                                                     ` Eli Zaretskii
@ 2013-11-01 12:33                                                       ` Stefan Monnier
  2013-11-04 17:37                                                         ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2013-11-01 12:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

> Would you prefer that we error out of default-directory is not a
> unibyte string at that point in loadup.el?

I'd prefer to either not do anything, or issue a warning, or error
out, yes.


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-11-01 12:33                                                       ` Stefan Monnier
@ 2013-11-04 17:37                                                         ` Eli Zaretskii
  0 siblings, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-11-04 17:37 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: 15260@debbugs.gnu.org
> Date: Fri, 01 Nov 2013 08:33:04 -0400
> 
> > Would you prefer that we error out of default-directory is not a
> > unibyte string at that point in loadup.el?
> 
> I'd prefer to either not do anything, or issue a warning, or error
> out, yes.

I eventually opted for erroring out, mostly to be able to catch any
unforseen problems and use cases I missed.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 17:59                                               ` Eli Zaretskii
  2013-10-31 19:24                                                 ` Stefan Monnier
@ 2013-11-04 17:35                                                 ` Eli Zaretskii
  2013-11-04 18:38                                                   ` Stefan Monnier
  1 sibling, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-11-04 17:35 UTC (permalink / raw)
  To: monnier, rgm; +Cc: 15260-done

> Date: Thu, 31 Oct 2013 19:59:52 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> If no problems pop up, I will commit this in a few days.

No further comments, and I got fed up with resolving merge conflicts
every day, so I committed the changes, and I'm marking this bug done.

Thanks to everybody for their feedback and support.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-11-04 17:35                                                 ` Eli Zaretskii
@ 2013-11-04 18:38                                                   ` Stefan Monnier
  0 siblings, 0 replies; 50+ messages in thread
From: Stefan Monnier @ 2013-11-04 18:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260-done

> Thanks to everybody for their feedback and support.

Thank you, Eli,


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31  1:01                                           ` Stefan Monnier
  2013-10-31  3:47                                             ` Eli Zaretskii
@ 2013-10-31 17:16                                             ` Eli Zaretskii
  2013-10-31 18:09                                               ` Stefan Monnier
  1 sibling, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-31 17:16 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Kenichi Handa <handa@gnu.org>,  15260@debbugs.gnu.org
> Date: Wed, 30 Oct 2013 21:01:21 -0400
> 
> > Why do we have this confusing inconsistency?
> 
> make_string is a bug.  There's no way to know/guess if the string should
> be unibyte or multibyte.  So, it should be removed and replaced by calls
> to either make_unibyte_string or make_multibyte_string.

Here's one more gotcha I bumped into while working on this bug.

Suppose the filesystem where you build Emacs uses a file-name encoding
whose coding-system-category is 'charset'.  Example: cpNNNN.  Then,
when Emacs comes up after dumping, it loads subdirs.el in each
directory on load-path.  To do this, it calls 'openp' to look for
DIR/subdirs.el, which involves calling ENCODE_FILE on
"DIR/subdirs.el", in order to pass that to 'faccessat' or 'open'.
Now, if the charset that is needed to encode this file name is not yet
loaded into Emacs, Emacs will try to load it.  To this end, it will
look along charset-map-path for the corresponding map file, and for
that it will again call 'openp', recursively.  That 'openp' call will
again want to ENCODE_FILE with the same encoding, which will again
cause Emacs to try to load the corresponding map file, etc. etc.,
until we exhaust the specpdl stack.

I worked around this by keeping charset-map-path in unibyte form until
later into the startup procedure.  Is there a more elegant and less
kludgey way?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 17:16                                             ` Eli Zaretskii
@ 2013-10-31 18:09                                               ` Stefan Monnier
  2013-10-31 18:37                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2013-10-31 18:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

>>>>> "Eli" == Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> Cc: Kenichi Handa <handa@gnu.org>,  15260@debbugs.gnu.org
>> Date: Wed, 30 Oct 2013 21:01:21 -0400
>> 
>> > Why do we have this confusing inconsistency?
>> 
>> make_string is a bug.  There's no way to know/guess if the string should
>> be unibyte or multibyte.  So, it should be removed and replaced by calls
>> to either make_unibyte_string or make_multibyte_string.

> Here's one more gotcha I bumped into while working on this bug.

> Suppose the filesystem where you build Emacs uses a file-name encoding
> whose coding-system-category is 'charset'.  Example: cpNNNN.  Then,
> when Emacs comes up after dumping, it loads subdirs.el in each
> directory on load-path.  To do this, it calls 'openp' to look for
> DIR/subdirs.el, which involves calling ENCODE_FILE on
> "DIR/subdirs.el", in order to pass that to 'faccessat' or 'open'.
> Now, if the charset that is needed to encode this file name is not yet
> loaded into Emacs, Emacs will try to load it.  To this end, it will
> look along charset-map-path for the corresponding map file, and for
> that it will again call 'openp', recursively.  That 'openp' call will
> again want to ENCODE_FILE with the same encoding, which will again
> cause Emacs to try to load the corresponding map file, etc. etc.,
> until we exhaust the specpdl stack.

So you mean that we have:
- charset-map-path is a multibyte string.
- the file-name encoding uses a charset that's not yet loaded.
How do we get into such a state?


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 18:09                                               ` Stefan Monnier
@ 2013-10-31 18:37                                                 ` Eli Zaretskii
  2013-10-31 19:41                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-31 18:37 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: handa@gnu.org,  15260@debbugs.gnu.org
> Date: Thu, 31 Oct 2013 14:09:39 -0400
> 
> So you mean that we have:
> - charset-map-path is a multibyte string.
> - the file-name encoding uses a charset that's not yet loaded.

Yes.

> How do we get into such a state?

Not sure about the details, since I don't really understand when Emacs
needs to load the charset map.  Perhaps the map is needed only when we
need to encode a string, not for decoding?

Phenomenologically, this happened when charset-map-path was already
decoded (as opposed to being a unibyte string) when this part of
startup.el runs:

  ;; Convert preloaded file names in load-history to absolute.
  (let ((simple-file-name
	 ;; Look for simple.el or simple.elc and use their directory
	 ;; as the place where all Lisp files live.
	 (locate-file "simple" load-path (get-load-suffixes)))
	lisp-dir)

locate-file eventually calls 'openp', which wants to encode
directories from load-path concatenated with simple.el etc.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 18:37                                                 ` Eli Zaretskii
@ 2013-10-31 19:41                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-10-31 19:41 UTC (permalink / raw)
  To: monnier; +Cc: 15260

> Date: Thu, 31 Oct 2013 20:37:52 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15260@debbugs.gnu.org
> 
> > From: Stefan Monnier <monnier@iro.umontreal.ca>
> > Cc: handa@gnu.org,  15260@debbugs.gnu.org
> > Date: Thu, 31 Oct 2013 14:09:39 -0400
> > 
> > So you mean that we have:
> > - charset-map-path is a multibyte string.
> > - the file-name encoding uses a charset that's not yet loaded.
> 
> Yes.
> 
> > How do we get into such a state?
> 
> Not sure about the details, since I don't really understand when Emacs
> needs to load the charset map.  Perhaps the map is needed only when we
> need to encode a string, not for decoding?

Actually, as can be seen from load_charset_map, we do different things
when the map is needed for decoding and for encoding.  So what
probably happened was that when the file names in load-path etc. were
decoded from cpNNNN, the map file was loaded and load_charset_map did
whatever was necessary to set up the decoder for this encoding.  Then,
when we need to encode a file name using the same cpNNNN, the map file
is loaded again, and load_charset_map now sets up the encoder.

When the decoder was set up, charset-map-path was still in unibyte
form, so the whole thing worked, because ENCODE_FILE doesn't try to
encode unibyte strings.  But once charset-map-path itself was decoded,
the recursive call to 'openp' inside load_charset_map_from_file tried
to encode it, and triggered infinite recursion.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-29  1:35                                   ` Stefan Monnier
  2013-10-29  3:47                                     ` Eli Zaretskii
@ 2013-11-01 13:58                                     ` Kenichi Handa
  1 sibling, 0 replies; 50+ messages in thread
From: Kenichi Handa @ 2013-11-01 13:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15260

In article <jwveh746euj.fsf-monnier+emacsbugs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> >   emacs = EMACSLOADPATH="$(abs_lisp)" LC_ALL=C "$(EMACS)" $(EMACSOPT)
> >                                       ^^^^^^^^

> > Does anyone know or remember why we set LC_ALL=C while running
> > commands in lisp/ (and the same in leim/)?

> IIRC the issue was to avoid things like misdetecting coding-systems
> because of the user's locale setting, in the files we load/compile.

As far as I remember, yes.

> IOW, it was to work around bugs (e.g. missing coding: cookie) and is
> likely unneeded nowadays.

I agree.

---
Kenichi Handa
handa@gnu.org





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-28 16:47                               ` Eli Zaretskii
  2013-10-28 18:33                                 ` Eli Zaretskii
@ 2013-10-31 21:45                                 ` Glenn Morris
  2013-11-01  7:45                                   ` Eli Zaretskii
  1 sibling, 1 reply; 50+ messages in thread
From: Glenn Morris @ 2013-10-31 21:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15260

Eli Zaretskii wrote:

> mule-cmds.el calls reset-language-environment, and language/english.el
> calls set-language-info-alist; both have the effect of resetting
> default-file-name-coding-system to latin-1 (!? an interesting
> "default" for a Unicode-era Emacs, perhaps Handa-san could comment why
> we still do that).

I know nothing about this, but eg glib defaults to utf-8, which seems
like a better default to me these days:

https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings





^ permalink raw reply	[flat|nested] 50+ messages in thread

* bug#15260: cannot build in a directory with non-ascii characters
  2013-10-31 21:45                                 ` Glenn Morris
@ 2013-11-01  7:45                                   ` Eli Zaretskii
  0 siblings, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2013-11-01  7:45 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15260

> From: Glenn Morris <rgm@gnu.org>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,  handa@gnu.org,  15260@debbugs.gnu.org
> Date: Thu, 31 Oct 2013 17:45:32 -0400
> 
> Eli Zaretskii wrote:
> 
> > mule-cmds.el calls reset-language-environment, and language/english.el
> > calls set-language-info-alist; both have the effect of resetting
> > default-file-name-coding-system to latin-1 (!? an interesting
> > "default" for a Unicode-era Emacs, perhaps Handa-san could comment why
> > we still do that).
> 
> I know nothing about this, but eg glib defaults to utf-8, which seems
> like a better default to me these days:

Yes, probably.  That's why I wrote that comment in parens.

Fortunately, the final patch side-steps this issue altogether by
keeping all the related file names as unibyte strings, so that the
current defaults for encoding file names do not affect anything.  So
we can reason about the default independently of the issues in this
bug.





^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2013-11-04 18:38 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-03 17:46 bug#15260: cannot build in a directory with non-ascii characters Glenn Morris
2013-10-23 20:48 ` Glenn Morris
2013-10-24 18:25   ` Eli Zaretskii
2013-10-24 18:35     ` Glenn Morris
2013-10-25 14:25       ` Eli Zaretskii
2013-10-25 17:08         ` Glenn Morris
2013-10-25 18:31           ` Eli Zaretskii
2013-10-25 18:40             ` Glenn Morris
2013-10-25 18:46               ` Eli Zaretskii
2013-10-25 19:27                 ` Eli Zaretskii
2013-10-26  7:50                   ` Eli Zaretskii
2013-10-26 19:15                     ` Glenn Morris
2013-10-26 20:04                       ` Eli Zaretskii
2013-10-27  3:56                         ` Eli Zaretskii
2013-10-27 16:19                           ` Eli Zaretskii
2013-10-27 19:02                             ` Eli Zaretskii
2013-10-27 19:43                               ` Eli Zaretskii
2013-10-27  4:28                     ` Stefan Monnier
2013-10-27 16:11                       ` Eli Zaretskii
2013-10-28  0:30                         ` Stefan Monnier
2013-10-28  3:39                           ` Eli Zaretskii
2013-10-28  4:05                             ` Stefan Monnier
2013-10-28 16:47                               ` Eli Zaretskii
2013-10-28 18:33                                 ` Eli Zaretskii
2013-10-28 22:00                                   ` Glenn Morris
2013-10-29  3:42                                     ` Eli Zaretskii
2013-10-29  1:35                                   ` Stefan Monnier
2013-10-29  3:47                                     ` Eli Zaretskii
2013-10-29 13:56                                       ` Stefan Monnier
2013-10-30 18:19                                         ` Eli Zaretskii
2013-10-31  1:01                                           ` Stefan Monnier
2013-10-31  3:47                                             ` Eli Zaretskii
2013-10-31 13:40                                               ` Stefan Monnier
2013-10-31 16:25                                                 ` Eli Zaretskii
2013-10-31 18:04                                                   ` Stefan Monnier
2013-10-31 17:59                                               ` Eli Zaretskii
2013-10-31 19:24                                                 ` Stefan Monnier
2013-10-31 19:33                                                   ` Eli Zaretskii
2013-11-01  9:27                                                     ` Eli Zaretskii
2013-11-01 12:33                                                       ` Stefan Monnier
2013-11-04 17:37                                                         ` Eli Zaretskii
2013-11-04 17:35                                                 ` Eli Zaretskii
2013-11-04 18:38                                                   ` Stefan Monnier
2013-10-31 17:16                                             ` Eli Zaretskii
2013-10-31 18:09                                               ` Stefan Monnier
2013-10-31 18:37                                                 ` Eli Zaretskii
2013-10-31 19:41                                                   ` Eli Zaretskii
2013-11-01 13:58                                     ` Kenichi Handa
2013-10-31 21:45                                 ` Glenn Morris
2013-11-01  7:45                                   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).