unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Foreign file names on MS-Windows
@ 2008-03-22 12:50 Eli Zaretskii
  2008-03-22 13:21 ` Lennart Borgman (gmail)
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Eli Zaretskii @ 2008-03-22 12:50 UTC (permalink / raw)
  To: emacs-devel

This is a bit longish, but there's an important question near the end,
related to the release of Emacs 22.2, so please bear with me.

In the context of this message, ``foreign file names'' means file
names that cannot be expressed using the current system codepage.  For
example, Cyrillic file names on a system whose codepage is 1252 (which
supports only Latin-1 characters).

Problem description: The Windows filesystem holds file names in UTF-16
encoding, which allows it to support file names outside of the current
locale.  Emacs currently uses the ANSI variants of filesystem APIs, so
the file names returned by system calls on which `readdir' (emulated
in src/w32.c) relies are converted by these system calls to the
current codepage.  When Windows encounters characters that cannot be
converted to the current codepage, it converts them to question marks
`?' instead.  A `?' is a character that cannot appear in a valid file
name on a Windows filesystem, so Emacs primitives that are built on
top of `readdir', such as `directory-files-and-attributes',
`directory-files', and file-name completion primitives, all fail for
these file names in different ways: at best these files are silently
omitted from the output, and at worst you see some weirdo error
messages.

A case in point is "C-x d", which on Windows uses `ls' emulation in
ls-lisp, which in turn calls `directory-files-and-attributes': a
simple "C-x d" silently omits foreign file names from the directory
listing, while "C-u C-x d -altr RET" complains about something being
nil instead of a number, and fails to sort the file names as
requested.  This is because `file-attributes' fails for a file name
that includes `?' characters, and `directory-files-and-attributes'
then returns such files without attributes.

Eventually, Emacs 23 should switch to using Unicode APIs to the
filesystem, which will resolve this problem (but we will need to
figure out how not to break W9x versions of Windows, where Unicode
support is an add-on that is typically not installed).

A temporary bandaid, and the only solution that is practical for Emacs
22, is to modify `readdir' to return the 8+3 aliase of the problematic
file name instead of the long name.  The 8+3 aliases use only 7-bit
ASCII characters; they are ugly and butchered to the point of being
unrecognizable, but are otherwise fully functional.  The change below,
which I installed on the trunk, shows how to do that.  After this
change, at least "C-x d", both with and without "C-u", works for me on
directories with such file names.

Now the important question I promised at the beginning: Should we
install this change on the release branch?  Here are the pros and cons
that I could think of for this decision:

Cons:

  . It is too close to release for such non-trivial changes.

  . The affected primitives are used in lots of places, and this
    change could break them, and the Lisp code that uses them.

  . This problem exists in Emacs for a long time, so it's not a big
    deal if it continues to exist some more (until resolved in Emacs
    23).

  . The suggested solution is only partial, and the resulting file
    names are UGLY.

Pros:

  . The bug is quite grave: it causes real data loss.

  . Whatever code uses the affected primitives is probably already
    broken.

  . The change is very simple, so the probability of it being buggy
    is very low (but please eyeball the diffs below to make it lower
    still).

Yidong and Stefan, please decide whether the change below should be
installed on the release branch.

2008-03-22  Eli Zaretskii  <eliz@gnu.org>

	* w32.c (readdir): If FindFirstFile/FindNextFile return in
	cFileName a file name that includes `?' characters, use the 8+3
	alias in cAlternateFileName instead.

Index: src/w32.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/w32.c,v
retrieving revision 1.130
diff -u -p -r1.130 w32.c
--- src/w32.c	24 Feb 2008 10:09:03 -0000	1.130
+++ src/w32.c	22 Mar 2008 11:51:07 -0000
@@ -1889,6 +1889,8 @@ closedir (DIR *dirp)
 struct direct *
 readdir (DIR *dirp)
 {
+  int downcase = !NILP (Vw32_downcase_file_names);
+
   if (wnet_enum_handle != INVALID_HANDLE_VALUE)
     {
       if (!read_unc_volume (wnet_enum_handle,
@@ -1923,14 +1925,23 @@ readdir (DIR *dirp)
      value returned by stat().  */
   dir_static.d_ino = 1;
 
+  strcpy (dir_static.d_name, dir_find_data.cFileName);
+
+  /* If the file name in cFileName[] includes `?' characters, it means
+     the original file name used characters that cannot be represented
+     by the current ANSI codepage.  To avoid total lossage, retrieve
+     the short 8+3 alias of the long file name.  */
+  if (_mbspbrk (dir_static.d_name, "?"))
+    {
+      strcpy (dir_static.d_name, dir_find_data.cAlternateFileName);
+      downcase = 1;	/* 8+3 aliases are returned in all caps */
+    }
+  dir_static.d_namlen = strlen (dir_static.d_name);
   dir_static.d_reclen = sizeof (struct direct) - MAXNAMLEN + 3 +
     dir_static.d_namlen - dir_static.d_namlen % 4;
-
-  dir_static.d_namlen = strlen (dir_find_data.cFileName);
-  strcpy (dir_static.d_name, dir_find_data.cFileName);
   if (dir_is_fat)
     _strlwr (dir_static.d_name);
-  else if (!NILP (Vw32_downcase_file_names))
+  else if (downcase)
     {
       register char *p;
       for (p = dir_static.d_name; *p; p++)




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-22 12:50 Foreign file names on MS-Windows Eli Zaretskii
@ 2008-03-22 13:21 ` Lennart Borgman (gmail)
  2008-03-22 13:32   ` Eli Zaretskii
  2008-03-22 14:34 ` Eli Zaretskii
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Lennart Borgman (gmail) @ 2008-03-22 13:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii wrote:
> Now the important question I promised at the beginning: Should we
> install this change on the release branch?  Here are the pros and cons
> that I could think of for this decision:
> 
> Cons:
> 
>   . It is too close to release for such non-trivial changes.
> 
>   . The affected primitives are used in lots of places, and this
>     change could break them, and the Lisp code that uses them.
> 
>   . This problem exists in Emacs for a long time, so it's not a big
>     deal if it continues to exist some more (until resolved in Emacs
>     23).
> 
>   . The suggested solution is only partial, and the resulting file
>     names are UGLY.
> 
> Pros:
> 
>   . The bug is quite grave: it causes real data loss.
> 
>   . Whatever code uses the affected primitives is probably already
>     broken.
> 
>   . The change is very simple, so the probability of it being buggy
>     is very low (but please eyeball the diffs below to make it lower
>     still).

Does the problem affect only W9x systems? Does the patch only affect them?




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-22 13:21 ` Lennart Borgman (gmail)
@ 2008-03-22 13:32   ` Eli Zaretskii
  0 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2008-03-22 13:32 UTC (permalink / raw)
  To: Lennart Borgman (gmail); +Cc: emacs-devel

> Date: Sat, 22 Mar 2008 14:21:10 +0100
> From: "Lennart Borgman (gmail)" <lennart.borgman@gmail.com>
> CC: emacs-devel@gnu.org
> 
> Eli Zaretskii wrote:
> > Now the important question I promised at the beginning: Should we
> > install this change on the release branch?  Here are the pros and cons
> > that I could think of for this decision:
> > 
> > Cons:
> > 
> >   . It is too close to release for such non-trivial changes.
> > 
> >   . The affected primitives are used in lots of places, and this
> >     change could break them, and the Lisp code that uses them.
> > 
> >   . This problem exists in Emacs for a long time, so it's not a big
> >     deal if it continues to exist some more (until resolved in Emacs
> >     23).
> > 
> >   . The suggested solution is only partial, and the resulting file
> >     names are UGLY.
> > 
> > Pros:
> > 
> >   . The bug is quite grave: it causes real data loss.
> > 
> >   . Whatever code uses the affected primitives is probably already
> >     broken.
> > 
> >   . The change is very simple, so the probability of it being buggy
> >     is very low (but please eyeball the diffs below to make it lower
> >     still).
> 
> Does the problem affect only W9x systems? Does the patch only affect them?

No, on the contrary: this problem cannot at all happen on W9x, unless
they have the Unicode add-on installed.  If they do have it, I think
the change will affect them as well.  But currently, it affects mainly
the newer versions: W2K, XP, and Vista that use NTFS as their main
filesystems.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-22 12:50 Foreign file names on MS-Windows Eli Zaretskii
  2008-03-22 13:21 ` Lennart Borgman (gmail)
@ 2008-03-22 14:34 ` Eli Zaretskii
  2008-03-22 15:38 ` Jason Rumney
  2008-03-22 17:32 ` Stefan Monnier
  3 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2008-03-22 14:34 UTC (permalink / raw)
  To: emacs-devel

> Date: Sat, 22 Mar 2008 14:50:59 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> 
> Problem description: The Windows filesystem holds file names in UTF-16
> encoding                 ^^^^^^^^^^^^^^^^^^

I should have probably said "the NTFS filesystem".  I don't think any
other native Windows filesystems use UTF-16 (or support file names
outside the current locale, for that matter).




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-22 12:50 Foreign file names on MS-Windows Eli Zaretskii
  2008-03-22 13:21 ` Lennart Borgman (gmail)
  2008-03-22 14:34 ` Eli Zaretskii
@ 2008-03-22 15:38 ` Jason Rumney
  2008-03-22 17:26   ` Eli Zaretskii
  2008-03-22 17:32 ` Stefan Monnier
  3 siblings, 1 reply; 11+ messages in thread
From: Jason Rumney @ 2008-03-22 15:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii wrote:
> A temporary bandaid, and the only solution that is practical for Emacs
> 22, is to modify `readdir' to return the 8+3 aliase of the problematic
> file name instead of the long name.

Seems OK to me. Is readdir really the only place that needs to be fixed 
though?





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-22 15:38 ` Jason Rumney
@ 2008-03-22 17:26   ` Eli Zaretskii
  0 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2008-03-22 17:26 UTC (permalink / raw)
  To: Jason Rumney; +Cc: emacs-devel

> Date: Sat, 22 Mar 2008 15:38:05 +0000
> From: Jason Rumney <jasonr@gnu.org>
> CC: emacs-devel@gnu.org
> 
> Eli Zaretskii wrote:
> > A temporary bandaid, and the only solution that is practical for Emacs
> > 22, is to modify `readdir' to return the 8+3 aliase of the problematic
> > file name instead of the long name.
> 
> Seems OK to me.

Thanks for the feedback.

> Is readdir really the only place that needs to be fixed though?

I think so, yes.  All the primitives that I could have thought about
call `readdir' to get the file names.  file-attributes obviously
doesn't, but if it is called with a file name that came from
`readdir', then file-attributes will work correctly as well.  If
file-attributes is called with a file name that cannot be encoded in
the current codepage, it will fail, but that problem has no solution
without full-blown UTF-16 file name support, I think.

But if you find other primitives that fail for such file names, please
name them.  Right now, the only ones that call FindFirstFile, apart of
`readdir', are `stat' and `w32_get_long_filename'.  I don't think the
latter can benefit of the kind of change I made in `readdir'.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-22 12:50 Foreign file names on MS-Windows Eli Zaretskii
                   ` (2 preceding siblings ...)
  2008-03-22 15:38 ` Jason Rumney
@ 2008-03-22 17:32 ` Stefan Monnier
  2008-03-22 18:15   ` Eli Zaretskii
  3 siblings, 1 reply; 11+ messages in thread
From: Stefan Monnier @ 2008-03-22 17:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> --- src/w32.c	24 Feb 2008 10:09:03 -0000	1.130
> +++ src/w32.c	22 Mar 2008 11:51:07 -0000
> @@ -1889,6 +1889,8 @@ closedir (DIR *dirp)
>  struct direct *
>  readdir (DIR *dirp)
>  {
> +  int downcase = !NILP (Vw32_downcase_file_names);
> +
>    if (wnet_enum_handle != INVALID_HANDLE_VALUE)
>      {
>        if (!read_unc_volume (wnet_enum_handle,
> @@ -1923,14 +1925,23 @@ readdir (DIR *dirp)
>       value returned by stat().  */
>    dir_static.d_ino = 1;
 
> +  strcpy (dir_static.d_name, dir_find_data.cFileName);
> +
> +  /* If the file name in cFileName[] includes `?' characters, it means
> +     the original file name used characters that cannot be represented
> +     by the current ANSI codepage.  To avoid total lossage, retrieve
> +     the short 8+3 alias of the long file name.  */
> +  if (_mbspbrk (dir_static.d_name, "?"))
> +    {
> +      strcpy (dir_static.d_name, dir_find_data.cAlternateFileName);
> +      downcase = 1;	/* 8+3 aliases are returned in all caps */
> +    }
> +  dir_static.d_namlen = strlen (dir_static.d_name);
>    dir_static.d_reclen = sizeof (struct direct) - MAXNAMLEN + 3 +
>      dir_static.d_namlen - dir_static.d_namlen % 4;
> -
> -  dir_static.d_namlen = strlen (dir_find_data.cFileName);
> -  strcpy (dir_static.d_name, dir_find_data.cFileName);
>    if (dir_is_fat)
>      _strlwr (dir_static.d_name);
> -  else if (!NILP (Vw32_downcase_file_names))
> +  else if (downcase)
>      {
>        register char *p;
>        for (p = dir_static.d_name; *p; p++)

Why not just as in the patch below?
I.e. why fiddle with Vw32_downcase_file_names?
And more importantly why modify the code to change dir_static.d_namlen
before the computation of dir_static.d_reclen (which uses
dir_static.d_namlen)?

The patch below would seem pretty safe to me, whereas yours isn't quite
as obvious.  Yet, it looks like there was a good reason to do it the way
you did.


        Stefan


--- w32.c.~1.130.~	2008-02-24 08:11:40.000000000 -0500
+++ w32.c	2008-03-22 13:31:01.000000000 -0400
@@ -1926,8 +1926,16 @@
   dir_static.d_reclen = sizeof (struct direct) - MAXNAMLEN + 3 +
     dir_static.d_namlen - dir_static.d_namlen % 4;
 
-  dir_static.d_namlen = strlen (dir_find_data.cFileName);
+  /* If the file name in cFileName[] includes `?' characters, it means
+     the original file name used characters that cannot be represented
+     by the current ANSI codepage.  To avoid total lossage, retrieve
+     the short 8+3 alias of the long file name.  */
+  if (_mbspbrk (dir_find_data.cFileName, "?"))
+    strcpy (dir_static.d_name, dir_find_data.cAlternateFileName);
+  else
   strcpy (dir_static.d_name, dir_find_data.cFileName);
+  dir_static.d_namlen = strlen (dir_static.d_name);
+
   if (dir_is_fat)
     _strlwr (dir_static.d_name);
   else if (!NILP (Vw32_downcase_file_names))




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-22 17:32 ` Stefan Monnier
@ 2008-03-22 18:15   ` Eli Zaretskii
  2008-03-23  0:56     ` Stefan Monnier
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2008-03-22 18:15 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Sat, 22 Mar 2008 13:32:22 -0400
> 
> I.e. why fiddle with Vw32_downcase_file_names?

Because 8+3 aliases are reported in all caps, as the comment says, so
instead of "some foreign file name.mp3" you get something like
"ABCDE~45.MP3", and I fear that not every alist in Emacs that supports
file recognition by extension includes both lower- and upper-case
variants.  Also, lower-casing the 8+3 aliases makes them slightly less
ugly, IMO.

> And more importantly why modify the code to change dir_static.d_namlen
> before the computation of dir_static.d_reclen (which uses
> dir_static.d_namlen)?

The original code used d_namlen to compute d_reclen before the former
was set, which in effect used the value of d_namlen from the previous
invocation of `readdir' (since dir_static is a static variable).  I
fixed that by reordering the statements where d_namlen and d_reclen
are computed.  You will see that the `readdir' implementations in
sysdep.c all do it in the order I did, and I'm quite sure w32.c is
trying to faithfully emulate sysdep.c implementations.

Anyway, d_reclen is not used anywhere in Emacs (and I have no idea
what would it be useful for, even in principle), so if you prefer that
I don't touch d_reclen's line on the branch, I'm okay with that.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-22 18:15   ` Eli Zaretskii
@ 2008-03-23  0:56     ` Stefan Monnier
  2008-03-23  4:22       ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Monnier @ 2008-03-23  0:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>> I.e. why fiddle with Vw32_downcase_file_names?

> Because 8+3 aliases are reported in all caps, as the comment says, so
> instead of "some foreign file name.mp3" you get something like
> "ABCDE~45.MP3", and I fear that not every alist in Emacs that supports
> file recognition by extension includes both lower- and upper-case
> variants.  Also, lower-casing the 8+3 aliases makes them slightly less
> ugly, IMO.

The ugliness factor is irrelevant: those files are ugly anyway, and
having them in all-caps makes it arguably (to me) more clear what's
going on.

But the issue of recognition of all-caps extension is a good argument,
that's the one that should be in the comment.

>> And more importantly why modify the code to change dir_static.d_namlen
>> before the computation of dir_static.d_reclen (which uses
>> dir_static.d_namlen)?

> The original code used d_namlen to compute d_reclen before the former
> was set, which in effect used the value of d_namlen from the previous
> invocation of `readdir' (since dir_static is a static variable).  I
> fixed that by reordering the statements where d_namlen and d_reclen
> are computed.  You will see that the `readdir' implementations in
> sysdep.c all do it in the order I did, and I'm quite sure w32.c is
> trying to faithfully emulate sysdep.c implementations.

I see, so it's a separate bugfix.

> Anyway, d_reclen is not used anywhere in Emacs (and I have no idea
> what would it be useful for, even in principle), so if you prefer that
> I don't touch d_reclen's line on the branch, I'm okay with that.

I guess that would be preferable, then,


        Stefan




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-23  0:56     ` Stefan Monnier
@ 2008-03-23  4:22       ` Eli Zaretskii
  2008-03-24  1:15         ` Stefan Monnier
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2008-03-23  4:22 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Sat, 22 Mar 2008 20:56:51 -0400
> 
> >> I.e. why fiddle with Vw32_downcase_file_names?
> 
> > Because 8+3 aliases are reported in all caps, as the comment says, so
> > instead of "some foreign file name.mp3" you get something like
> > "ABCDE~45.MP3", and I fear that not every alist in Emacs that supports
> > file recognition by extension includes both lower- and upper-case
> > variants.  Also, lower-casing the 8+3 aliases makes them slightly less
> > ugly, IMO.
> 
> The ugliness factor is irrelevant: those files are ugly anyway, and
> having them in all-caps makes it arguably (to me) more clear what's
> going on.
> 
> But the issue of recognition of all-caps extension is a good argument,
> that's the one that should be in the comment.

Is it worth the trouble to downcase only the extension?




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Foreign file names on MS-Windows
  2008-03-23  4:22       ` Eli Zaretskii
@ 2008-03-24  1:15         ` Stefan Monnier
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Monnier @ 2008-03-24  1:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>> >> I.e. why fiddle with Vw32_downcase_file_names?
>> 
>> > Because 8+3 aliases are reported in all caps, as the comment says, so
>> > instead of "some foreign file name.mp3" you get something like
>> > "ABCDE~45.MP3", and I fear that not every alist in Emacs that supports
>> > file recognition by extension includes both lower- and upper-case
>> > variants.  Also, lower-casing the 8+3 aliases makes them slightly less
>> > ugly, IMO.
>> 
>> The ugliness factor is irrelevant: those files are ugly anyway, and
>> having them in all-caps makes it arguably (to me) more clear what's
>> going on.
>> 
>> But the issue of recognition of all-caps extension is a good argument,
>> that's the one that should be in the comment.

> Is it worth the trouble to downcase only the extension?

I never suggested such a thing.  I just meant to say that downcasing is
worthwhile because of the extension-matching done in the likes of
auto-mode-alist, but these don't only work on the extension anyway.
I.e. your code is fine, it's just the comment that doesn't give the good
reason for it.


        Stefan




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-03-24  1:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-22 12:50 Foreign file names on MS-Windows Eli Zaretskii
2008-03-22 13:21 ` Lennart Borgman (gmail)
2008-03-22 13:32   ` Eli Zaretskii
2008-03-22 14:34 ` Eli Zaretskii
2008-03-22 15:38 ` Jason Rumney
2008-03-22 17:26   ` Eli Zaretskii
2008-03-22 17:32 ` Stefan Monnier
2008-03-22 18:15   ` Eli Zaretskii
2008-03-23  0:56     ` Stefan Monnier
2008-03-23  4:22       ` Eli Zaretskii
2008-03-24  1:15         ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).