unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Multibyte and unibyte file names
@ 2013-01-23 17:45 Eli Zaretskii
  2013-01-23 18:08 ` Paul Eggert
                   ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-23 17:45 UTC (permalink / raw)
  To: emacs-devel; +Cc: Kazuhiro Ito, Michael Albinus

For some initial context, see

  http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13515#14

and my response there.  However, the issue at hand is IMO much more
broad.

Let me start with a question: do file primitives need to support
unibyte file names, as well as multibyte ones?  To avoid ambiguity,
let me say right away that by "unibyte" I mean here file names encoded
in some file-name-coding-system, possibly with non-ASCII characters.
I do NOT mean pure-ASCII file names (which in Emacs are normally
represented as unibyte strings).

Looking at the code, it sounds like the answer to the above is YES.
For example, expand-file-name clearly tries to be careful to support
both, as seen, for example, from this snippet:

  multibyte = STRING_MULTIBYTE (name);
  if (multibyte != STRING_MULTIBYTE (default_directory))
    {
      if (multibyte)
	default_directory = string_to_multibyte (default_directory);
      else
	{
	  name = string_to_multibyte (name);
	  multibyte = 1;
	}
    }

Moreover, some other primitives clearly expect other primitives to
work on encoded file names.  Here's a fragment from
file_name_completion:

  encoded_dir = ENCODE_FILE (dirname);

  block_input ();
  d = opendir (SSDATA (Fdirectory_file_name (encoded_dir)));

Assuming that encoded file names _should_ be supported, I think this
snippet, from directory_file_name, is a bug:

  if (srclen > 1
      && IS_DIRECTORY_SEP (dst[srclen - 1]))
    {
      dst[srclen - 1] = 0;
      srclen--;
    }

If dst[] is an encoded string that uses a multibyte encoding, it is
wrong to look at just the last byte of the string, because it could be
a trailing byte of some multibyte sequence, right?  There are a lot of
similar fragments in fileio.c, so much so that it seems as if there's
a hidden assumption that these strings cannot be encoded.  Which seems
to contradict the two fragments above, from expand-file-name and from
file_name_completion.  Am I missing something?

Why is this important?  For 2 main reasons:

 1) Many file primitives call dostounix_filename on MS-Windows.  That
    function converts backslashes to forward slashes and optionally
    down-cases the file name.  It is currently written to accept an
    encoded file name, and as long as file primitives need to support
    unibyte file names, dostounix_filename must DTRT with them.
    Encoding file names means in some situations that file names
    un-encodable in file-name-coding-system come out butchered from
    dostounix_filename, whereas some primitives are supposed to work
    on the file names on the syntactic level only, which is
    independent of whether or not a file can be passed to the
    underlying filesystem.  This also means that only cpNNNN encodings
    are fully supported on MS-Windows, because for other encodings
    Windows APIs don't have information which allows, e.g., advancing
    by characters in an encoded file name, looking for slashes and
    backslashes, and down-casing characters.

 2) This gets worse with remote file names.  For these, the handlers
    are always called first, and the result is never run through
    dostounix_filename.  However, Tramp sometimes turns around and
    calls the "real" handler on parts of the remote file name,
    evidently expecting that "real" handler not to do any harm.  But
    due to the above, it does do harm.  While it might be justified to
    limit native file name support to file names encodable with the
    current file-name-coding-system, it _cannot_ be justified for
    remote file names.  An example of this is file-name-directory:

     (defun tramp-handle-file-name-directory (file)
       "Like `file-name-directory' but aware of Tramp files."
       ;; Everything except the last filename thing is the directory.  We
       ;; cannot apply `with-parsed-tramp-file-name', because this expands
       ;; the remote file name parts.  This is a problem when we are in
       ;; file name completion.
       (let ((v (tramp-dissect-file-name file t)))
	 ;; Run the command on the localname portion only.
	 (tramp-make-tramp-file-name
	  (tramp-file-name-method v)
	  (tramp-file-name-user v)
	  (tramp-file-name-host v)
	  (tramp-run-real-handler
	   'file-name-directory (list (or (tramp-file-name-localname v) ""))))))

    which on Windows means that, e.g.

      (let ((file-name-coding-system 'cp1252))
	(file-name-directory "/eliz@fencepost.gnu.org:漢字/"))

       => "/eliz@fencepost.gnu.org:  /"

   And there are other similar handlers in Tramp (e.g., the
   file-name-nondirectory handler) which do the same.  IOW, they seem
   to _assume_ that the corresponding "real" handler never needs to
   encode the file name.  A false assumption.

I don't know what to do with this mess.  If file primitives are not
supposed to handle encoded file names, dostounix_filename could be
rewritten to work on multibyte strings in Emacs's internal
representation, and then it wouldn't need to rely on Windows APIs that
require the encoding to be known to Windows and the characters in the
file name be encodable in that encoding.  But that would need
non-trivial changes elsewhere, and we need to decide what to do if an
encoded string does get passed to these primitives (signal an error?).

Note that, as long as encoded multibyte strings can get into these
primitives, code that advances by bytes and examines individual bytes
for equality to certain values like '/' is buggy on Unix as well,
unless I'm missing something.

Comments are welcome, as well as pointers to what I missed.

TIA




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 17:45 Multibyte and unibyte file names Eli Zaretskii
@ 2013-01-23 18:08 ` Paul Eggert
  2013-01-23 19:04   ` Eli Zaretskii
  2013-01-23 19:42 ` Michael Albinus
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Paul Eggert @ 2013-01-23 18:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Kazuhiro Ito, Michael Albinus, emacs-devel

On 01/23/13 09:45, Eli Zaretskii wrote:

>   if (srclen > 1
>       && IS_DIRECTORY_SEP (dst[srclen - 1]))
>     {
>       dst[srclen - 1] = 0;
>       srclen--;
>     }
> 
> If dst[] is an encoded string that uses a multibyte encoding, it is
> wrong to look at just the last byte of the string, because it could be
> a trailing byte of some multibyte sequence, right?

If memory serves, the answer to that question is different for
GNU / POSIX / etc (GNUish) systems than for MS-Windows systems.
On GNUish systems, the kernel doesn't know about encodings,
so the above code is correct for the file system even if
it produces a byte string that is not properly encoded for
the file name coding system.  On MS-Windows systems, as I
understand it, the operating system is cognizant of which
file name encoding you're using, so the above is indeed an error.

In practice nobody in the GNUish world uses encodings that
are unsafe for '/', so to some extent this is just a theoretical
issue in the GNUish world -- it just doesn't come up.

Unfortunately I don't understand the ins and outs of the
MSish side, or of the Tramp side, so I can't speak to how
that should work.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 18:08 ` Paul Eggert
@ 2013-01-23 19:04   ` Eli Zaretskii
  2013-01-23 23:38     ` Paul Eggert
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-23 19:04 UTC (permalink / raw)
  To: Paul Eggert; +Cc: kzhr, michael.albinus, emacs-devel

> Date: Wed, 23 Jan 2013 10:08:25 -0800
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org, Kazuhiro Ito <kzhr@d1.dion.ne.jp>, 
>  Michael Albinus <michael.albinus@gmx.de>
> 
> On 01/23/13 09:45, Eli Zaretskii wrote:
> 
> >   if (srclen > 1
> >       && IS_DIRECTORY_SEP (dst[srclen - 1]))
> >     {
> >       dst[srclen - 1] = 0;
> >       srclen--;
> >     }
> > 
> > If dst[] is an encoded string that uses a multibyte encoding, it is
> > wrong to look at just the last byte of the string, because it could be
> > a trailing byte of some multibyte sequence, right?
> 
> If memory serves, the answer to that question is different for
> GNU / POSIX / etc (GNUish) systems than for MS-Windows systems.
> On GNUish systems, the kernel doesn't know about encodings,
> so the above code is correct for the file system even if
> it produces a byte string that is not properly encoded for
> the file name coding system.

I understand that, but what it means is that encoding a file name,
then removing its last "slash" as above, then decoding it again will
yield a wrong or even an invalid string, right?  IOW, Emacs will still
have a bug, even though from the OS point of view that slash would
have been regarded as a directory separator.

> On MS-Windows systems, as I understand it, the operating system is
> cognizant of which file name encoding you're using, so the above is
> indeed an error.

The OS uses UTF-16 for file names, but APIs Emacs uses accept
single-byte or DBCS encoded file names, which are converted to UTF-16
internally, before handing them to the filesystem layer.  It is this
conversion that must support the original encoding, or else the UTF-16
result will be incorrect, or in extreme cases the API itself will fail
and reject the file name.

> In practice nobody in the GNUish world uses encodings that
> are unsafe for '/', so to some extent this is just a theoretical
> issue in the GNUish world -- it just doesn't come up.

Yes, that part is quite clear.  Likewise, since UTF-8 is almost always
the file-name encoding, bugs whereby un-encoded file names are passed
to system APIs can easily go unnoticed.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 17:45 Multibyte and unibyte file names Eli Zaretskii
  2013-01-23 18:08 ` Paul Eggert
@ 2013-01-23 19:42 ` Michael Albinus
  2013-01-23 20:05   ` Eli Zaretskii
  2013-01-23 21:09 ` Stefan Monnier
  2013-01-24 10:00 ` Michael Albinus
  3 siblings, 1 reply; 48+ messages in thread
From: Michael Albinus @ 2013-01-23 19:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Kazuhiro Ito, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>  2) This gets worse with remote file names.  For these, the handlers
>     are always called first, and the result is never run through
>     dostounix_filename.  However, Tramp sometimes turns around and
>     calls the "real" handler on parts of the remote file name,
>     evidently expecting that "real" handler not to do any harm.  But
>     due to the above, it does do harm.  While it might be justified to
>     limit native file name support to file names encodable with the
>     current file-name-coding-system, it _cannot_ be justified for
>     remote file names.  An example of this is file-name-directory:
>
>      (defun tramp-handle-file-name-directory (file)
>        "Like `file-name-directory' but aware of Tramp files."
>        ;; Everything except the last filename thing is the directory.  We
>        ;; cannot apply `with-parsed-tramp-file-name', because this expands
>        ;; the remote file name parts.  This is a problem when we are in
>        ;; file name completion.
>        (let ((v (tramp-dissect-file-name file t)))
> 	 ;; Run the command on the localname portion only.
> 	 (tramp-make-tramp-file-name
> 	  (tramp-file-name-method v)
> 	  (tramp-file-name-user v)
> 	  (tramp-file-name-host v)
> 	  (tramp-run-real-handler
> 	   'file-name-directory (list (or (tramp-file-name-localname v) ""))))))
>
>     which on Windows means that, e.g.
>
>       (let ((file-name-coding-system 'cp1252))
> 	(file-name-directory "/eliz@fencepost.gnu.org:漢字/"))
>
>        => "/eliz@fencepost.gnu.org:  /"
>
>    And there are other similar handlers in Tramp (e.g., the
>    file-name-nondirectory handler) which do the same.  IOW, they seem
>    to _assume_ that the corresponding "real" handler never needs to
>    encode the file name.  A false assumption.

Tramp is not prepared to handle encoded file names. One of the first
actions on the remote side is to set the environment "LC_ALL=C". An
exception are Android devices, which require UTF-8.

I agree, Tramp shall check carefully what a file name encoding is. This
must be added to the code.

There might be a chance to switch to en_US.UTF-8 on the remote side. But
even here I would propose to start with the unibyte subset. "en_US",
because Tramp parses the output of commands, which must not be
localized.

Other encodings but UTF-8 will be hard to support. It is not only that
Tramp calls "native" file name primitives, there are also several
parsing routines for commands on the remote side, which have their
expectations on file name syntax and their encodings.

> TIA

Best regards, Michael.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 19:42 ` Michael Albinus
@ 2013-01-23 20:05   ` Eli Zaretskii
  2013-01-23 20:58     ` Michael Albinus
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-23 20:05 UTC (permalink / raw)
  To: Michael Albinus; +Cc: kzhr, emacs-devel

> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: emacs-devel@gnu.org,  Kazuhiro Ito <kzhr@d1.dion.ne.jp>
> Date: Wed, 23 Jan 2013 20:42:55 +0100
> 
> >      (defun tramp-handle-file-name-directory (file)
> >        "Like `file-name-directory' but aware of Tramp files."
> >        ;; Everything except the last filename thing is the directory.  We
> >        ;; cannot apply `with-parsed-tramp-file-name', because this expands
> >        ;; the remote file name parts.  This is a problem when we are in
> >        ;; file name completion.
> >        (let ((v (tramp-dissect-file-name file t)))
> > 	 ;; Run the command on the localname portion only.
> > 	 (tramp-make-tramp-file-name
> > 	  (tramp-file-name-method v)
> > 	  (tramp-file-name-user v)
> > 	  (tramp-file-name-host v)
> > 	  (tramp-run-real-handler
> > 	   'file-name-directory (list (or (tramp-file-name-localname v) ""))))))
> >
> >     which on Windows means that, e.g.
> >
> >       (let ((file-name-coding-system 'cp1252))
> > 	(file-name-directory "/eliz@fencepost.gnu.org:漢字/"))
> >
> >        => "/eliz@fencepost.gnu.org:  /"
> >
> >    And there are other similar handlers in Tramp (e.g., the
> >    file-name-nondirectory handler) which do the same.  IOW, they seem
> >    to _assume_ that the corresponding "real" handler never needs to
> >    encode the file name.  A false assumption.
> 
> Tramp is not prepared to handle encoded file names.

I didn't try to imply it should.  Tramp should not, however, delegate
its handlers' job to "native" implementations, because those cannot,
in general, be assumed to DTRT for the remote host.

For example, in the particular case of file-name-directory, I think
Tramp should simply do its job by a straightforward removal of the
portion after the last slash in Lisp, instead of calling the native
implementation.

> I agree, Tramp shall check carefully what a file name encoding is. This
> must be added to the code.

Sorry, I don't follow.  File names in Lisp are not encoded in any
way.  You only need to encode them when you pass them to commands
executed on the remote host, and decode the results that are output by
those remote commands.

> There might be a chance to switch to en_US.UTF-8 on the remote side. But
> even here I would propose to start with the unibyte subset. "en_US",
> because Tramp parses the output of commands, which must not be
> localized.

Why "must not be localized"?

> Other encodings but UTF-8 will be hard to support. It is not only that
> Tramp calls "native" file name primitives, there are also several
> parsing routines for commands on the remote side, which have their
> expectations on file name syntax and their encodings.

I'm afraid I don't follow here, either.  Emacs is well equipped to
do code conversions from and to almost any encoding out there.  The
only problem is to know which encoding to use when communicating with
the commands on the remote host.  What am I missing?




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 20:05   ` Eli Zaretskii
@ 2013-01-23 20:58     ` Michael Albinus
  2013-01-24 16:37       ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Michael Albinus @ 2013-01-23 20:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> For example, in the particular case of file-name-directory, I think
> Tramp should simply do its job by a straightforward removal of the
> portion after the last slash in Lisp, instead of calling the native
> implementation.

This would duplicate code. I try to avoid, when possible.

>> I agree, Tramp shall check carefully what a file name encoding is. This
>> must be added to the code.
>
> Sorry, I don't follow.  File names in Lisp are not encoded in any
> way.  You only need to encode them when you pass them to commands
> executed on the remote host, and decode the results that are output by
> those remote commands.

Maybe there's a misunderstanding here. But you gave an example with a
file name with japanese codings.

>> There might be a chance to switch to en_US.UTF-8 on the remote side. But
>> even here I would propose to start with the unibyte subset. "en_US",
>> because Tramp parses the output of commands, which must not be
>> localized.
>
> Why "must not be localized"?

Tramp does not understand German messages, for example. "de_DE.UTF-8"
would be a no-go. That's why Tramp sets the remote locale to English
messages. Currently it is "C", it could be "en_US.UTF-8" in the
furure. But I don't know, whether all remote hosts are already prepared
for UTF-8.

>> Other encodings but UTF-8 will be hard to support. It is not only that
>> Tramp calls "native" file name primitives, there are also several
>> parsing routines for commands on the remote side, which have their
>> expectations on file name syntax and their encodings.
>
> I'm afraid I don't follow here, either.  Emacs is well equipped to
> do code conversions from and to almost any encoding out there.  The
> only problem is to know which encoding to use when communicating with
> the commands on the remote host.  What am I missing?

Maybe one could teach Tramp to convert file names in whatever coding to
UTF-8. But shall we do it? And how would that work with other Emacs
flavors? Yes, I must keep XEmacs in mind.

Best regards, Michael.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 17:45 Multibyte and unibyte file names Eli Zaretskii
  2013-01-23 18:08 ` Paul Eggert
  2013-01-23 19:42 ` Michael Albinus
@ 2013-01-23 21:09 ` Stefan Monnier
  2013-01-24 17:02   ` Eli Zaretskii
  2013-01-24 10:00 ` Michael Albinus
  3 siblings, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-23 21:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Kazuhiro Ito, Michael Albinus, emacs-devel

> Let me start with a question: do file primitives need to support
> unibyte file names, as well as multibyte ones?

[ Oh no, not this mess!  ]

> If dst[] is an encoded string that uses a multibyte encoding, it is
> wrong to look at just the last byte of the string, because it could be
> a trailing byte of some multibyte sequence, right?

In theory, yes.  In practice it doesn't seem to be too much of
a problem, tho it could become more serious if we start using utf-16 for
Windows.

Part of the problem is that not all systems agree on whether a file name
is a sequence of bytes or a sequence of characters.

I think that for w32 it makes sense to try and always decode file names
before returning them to Elisp:
Most file names passed to Elisp primitives are derived from file names
returned by Elisp primitives, so if Emacs decodes all the file names it
returns to Elisp, we can expect to see *very* few encoded file names
passed to Elisp primitives.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 19:04   ` Eli Zaretskii
@ 2013-01-23 23:38     ` Paul Eggert
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Eggert @ 2013-01-23 23:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

On 01/23/13 11:04, Eli Zaretskii wrote:
> encoding a file name,
> then removing its last "slash" as above, then decoding it again will
> yield a wrong or even an invalid string, right?

Yes, the code may be wrong for MS-Windows,
but since the problem can't possibly occur in a POSIXish
system, the code is correct as-is on POSIX.  It's correct to
always treat a '/' byte as a directory separator, no matter what
the file name coding system is.  This is de facto true already,
and as I understand it the next version of POSIX will require this
to be true for all encodings.  Similarly, the next POSIX will
guarantee that '.' always stands for itself, in any encoding.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 17:45 Multibyte and unibyte file names Eli Zaretskii
                   ` (2 preceding siblings ...)
  2013-01-23 21:09 ` Stefan Monnier
@ 2013-01-24 10:00 ` Michael Albinus
  2013-01-24 16:40   ` Eli Zaretskii
  3 siblings, 1 reply; 48+ messages in thread
From: Michael Albinus @ 2013-01-24 10:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Kazuhiro Ito, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>     which on Windows means that, e.g.
>
>       (let ((file-name-coding-system 'cp1252))
> 	(file-name-directory "/eliz@fencepost.gnu.org:漢字/"))
>
>        => "/eliz@fencepost.gnu.org:  /"
>
>    And there are other similar handlers in Tramp (e.g., the
>    file-name-nondirectory handler) which do the same.  IOW, they seem
>    to _assume_ that the corresponding "real" handler never needs to
>    encode the file name.  A false assumption.

I've just committed a patch to the trunk, which checks in
`tramp-tramp-file-p', whether file names are unibyte strings. If they
aren't, Tramp ceases to work. That's the best I can do for the moment.

Best regards, Michael.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 20:58     ` Michael Albinus
@ 2013-01-24 16:37       ` Eli Zaretskii
  0 siblings, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-24 16:37 UTC (permalink / raw)
  To: Michael Albinus; +Cc: kzhr, emacs-devel

> From: Michael Albinus <michael.albinus@gmx.de>
> Date: Wed, 23 Jan 2013 21:58:59 +0100
> Cc: kzhr@d1.dion.ne.jp, emacs-devel@gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > For example, in the particular case of file-name-directory, I think
> > Tramp should simply do its job by a straightforward removal of the
> > portion after the last slash in Lisp, instead of calling the native
> > implementation.
> 
> This would duplicate code. I try to avoid, when possible.

I think we have no choice in this case.  There's no reason to assume
that processing remote file names with code that is based on local
filesystem will DTRT.  If it works, it's by sheer luck.

> >> I agree, Tramp shall check carefully what a file name encoding is. This
> >> must be added to the code.
> >
> > Sorry, I don't follow.  File names in Lisp are not encoded in any
> > way.  You only need to encode them when you pass them to commands
> > executed on the remote host, and decode the results that are output by
> > those remote commands.
> 
> Maybe there's a misunderstanding here. But you gave an example with a
> file name with japanese codings.

They were not encoded file names.  They were file names with Japanese
characters, but in the "usual" internal representation used by Emacs
for buffers and strings.  No encoding is involved.

> >> There might be a chance to switch to en_US.UTF-8 on the remote side. But
> >> even here I would propose to start with the unibyte subset. "en_US",
> >> because Tramp parses the output of commands, which must not be
> >> localized.
> >
> > Why "must not be localized"?
> 
> Tramp does not understand German messages, for example. "de_DE.UTF-8"
> would be a no-go. That's why Tramp sets the remote locale to English
> messages.

You can force English for messages, but still have file names be in
UTF-8, no?

> >> Other encodings but UTF-8 will be hard to support. It is not only that
> >> Tramp calls "native" file name primitives, there are also several
> >> parsing routines for commands on the remote side, which have their
> >> expectations on file name syntax and their encodings.
> >
> > I'm afraid I don't follow here, either.  Emacs is well equipped to
> > do code conversions from and to almost any encoding out there.  The
> > only problem is to know which encoding to use when communicating with
> > the commands on the remote host.  What am I missing?
> 
> Maybe one could teach Tramp to convert file names in whatever coding to
> UTF-8.

All you need is call decode-coding-string.

> But shall we do it? And how would that work with other Emacs
> flavors? Yes, I must keep XEmacs in mind.

I'd be surprised if XEmacs didn't support decode-coding-string or
UTF-8.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-24 10:00 ` Michael Albinus
@ 2013-01-24 16:40   ` Eli Zaretskii
  0 siblings, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-24 16:40 UTC (permalink / raw)
  To: Michael Albinus; +Cc: kzhr, emacs-devel

> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: emacs-devel@gnu.org,  Kazuhiro Ito <kzhr@d1.dion.ne.jp>
> Date: Thu, 24 Jan 2013 11:00:23 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >     which on Windows means that, e.g.
> >
> >       (let ((file-name-coding-system 'cp1252))
> > 	(file-name-directory "/eliz@fencepost.gnu.org:漢字/"))
> >
> >        => "/eliz@fencepost.gnu.org:  /"
> >
> >    And there are other similar handlers in Tramp (e.g., the
> >    file-name-nondirectory handler) which do the same.  IOW, they seem
> >    to _assume_ that the corresponding "real" handler never needs to
> >    encode the file name.  A false assumption.
> 
> I've just committed a patch to the trunk, which checks in
> `tramp-tramp-file-p', whether file names are unibyte strings. If they
> aren't, Tramp ceases to work. That's the best I can do for the moment.

I think this is wrong.  It means only pure-ASCII file names will be
supported by Tramp, which is too drastic and unjustified.

If you did that to resolve the problem with the above snippet, then
IMO the right solution for that is to write file-name-directory
handler in Lisp, instead of calling the built-in primitive, which has
the adverse side effect on Windows.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-23 21:09 ` Stefan Monnier
@ 2013-01-24 17:02   ` Eli Zaretskii
  2013-01-24 18:25     ` Stefan Monnier
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-24 17:02 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: emacs-devel@gnu.org, Kazuhiro Ito <kzhr@d1.dion.ne.jp>,
>         Michael Albinus <michael.albinus@gmx.de>
> Date: Wed, 23 Jan 2013 16:09:18 -0500
> 
> I think that for w32 it makes sense to try and always decode file names
> before returning them to Elisp:
> Most file names passed to Elisp primitives are derived from file names
> returned by Elisp primitives, so if Emacs decodes all the file names it
> returns to Elisp, we can expect to see *very* few encoded file names
> passed to Elisp primitives.

So you are saying that each primitive should detect unibyte file names
it gets as arguments and DECODE_FILE them right away?  We would also
need to modify all the C code that calls the primitives directly and
passes them encoded file names.  Did I understand your suggestion
correctly?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-24 17:02   ` Eli Zaretskii
@ 2013-01-24 18:25     ` Stefan Monnier
  2013-01-24 18:38       ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-24 18:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

>> I think that for w32 it makes sense to try and always decode file names
>> before returning them to Elisp:
>> Most file names passed to Elisp primitives are derived from file names
>> returned by Elisp primitives, so if Emacs decodes all the file names it
>> returns to Elisp, we can expect to see *very* few encoded file names
>> passed to Elisp primitives.
> So you are saying that each primitive should detect unibyte file names
> it gets as arguments and DECODE_FILE them right away?

I think not: I was talking about decoding the file names *returned*
by primitives.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-24 18:25     ` Stefan Monnier
@ 2013-01-24 18:38       ` Eli Zaretskii
  2013-01-25  0:06         ` Stefan Monnier
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-24 18:38 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de
> Date: Thu, 24 Jan 2013 13:25:45 -0500
> 
> >> I think that for w32 it makes sense to try and always decode file names
> >> before returning them to Elisp:
> >> Most file names passed to Elisp primitives are derived from file names
> >> returned by Elisp primitives, so if Emacs decodes all the file names it
> >> returns to Elisp, we can expect to see *very* few encoded file names
> >> passed to Elisp primitives.
> > So you are saying that each primitive should detect unibyte file names
> > it gets as arguments and DECODE_FILE them right away?
> 
> I think not: I was talking about decoding the file names *returned*
> by primitives.

What would be the difference, from the POV of the callers of the
primitives?

If there is no difference, what I suggested is easier to implement,
because it eliminates the need to test whether a given file name is
multibyte (and needs to be encoded) or not, before handing the file
names to system APIs.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-24 18:38       ` Eli Zaretskii
@ 2013-01-25  0:06         ` Stefan Monnier
  2013-01-25  7:37           ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-25  0:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

>> > So you are saying that each primitive should detect unibyte file names
>> > it gets as arguments and DECODE_FILE them right away?
>> I think not: I was talking about decoding the file names *returned*
>> by primitives.
> What would be the difference, from the POV of the callers of the
> primitives?

That the callers get to see meaningful (decoded) names?
That file-name manipulation functions don't have the side effect of
encoding/decoding file names?

> If there is no difference, what I suggested is easier to implement,
> because it eliminates the need to test whether a given file name is
> multibyte (and needs to be encoded) or not, before handing the file
> names to system APIs.

I don't see why it eliminates this need: file-exists-p can still be
called with multibyte and unibyte strings.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-25  0:06         ` Stefan Monnier
@ 2013-01-25  7:37           ` Eli Zaretskii
  2013-01-25 11:36             ` Stefan Monnier
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-25  7:37 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de
> Date: Thu, 24 Jan 2013 19:06:56 -0500
> 
> >> > So you are saying that each primitive should detect unibyte file names
> >> > it gets as arguments and DECODE_FILE them right away?
> >> I think not: I was talking about decoding the file names *returned*
> >> by primitives.
> > What would be the difference, from the POV of the callers of the
> > primitives?
> 
> That the callers get to see meaningful (decoded) names?
> That file-name manipulation functions don't have the side effect of
> encoding/decoding file names?

If we decode unibyte file names at entry to each primitive, before
doing anything else, and thereafter manipulate decoded multibyte
strings, this will happen anyway.

But since everybody (at least those who spoke) seem to think this is a
w32 only problem, I will solve it for w32 only.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-25  7:37           ` Eli Zaretskii
@ 2013-01-25 11:36             ` Stefan Monnier
  2013-01-25 20:31               ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-25 11:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

>> That the callers get to see meaningful (decoded) names?
>> That file-name manipulation functions don't have the side effect of
>> encoding/decoding file names?
> If we decode unibyte file names at entry to each primitive, before
> doing anything else, and thereafter manipulate decoded multibyte
> strings, this will happen anyway.

I get the impression that we're not talking about the same thing.
If you only decode on entry, then Elisp code will first see encoded file
names returned by directory-files and will then see them converted to
decoded form after passing the result to a file-name
manipulation function.

Which is why I suggest to decode right away in the functions that return
file names (e.g. directory-files).

> But since everybody (at least those who spoke) seem to think this is a
> w32 only problem, I will solve it for w32 only.

I think the specific problems you mentioned are mostly non-issues under
POSIX, but the general problem of deciding which representation to use
is more general.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-25 11:36             ` Stefan Monnier
@ 2013-01-25 20:31               ` Eli Zaretskii
  2013-01-25 22:28                 ` Stefan Monnier
  2013-01-26  3:04                 ` Stephen J. Turnbull
  0 siblings, 2 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-25 20:31 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de
> Date: Fri, 25 Jan 2013 06:36:39 -0500
> 
> >> That the callers get to see meaningful (decoded) names?
> >> That file-name manipulation functions don't have the side effect of
> >> encoding/decoding file names?
> > If we decode unibyte file names at entry to each primitive, before
> > doing anything else, and thereafter manipulate decoded multibyte
> > strings, this will happen anyway.
> 
> I get the impression that we're not talking about the same thing.

Looks like that.

> If you only decode on entry, then Elisp code will first see encoded file
> names returned by directory-files and will then see them converted to
> decoded form after passing the result to a file-name
> manipulation function.

No.  Elisp code will see _decoded_ file names from directory-files,
because we already decode them.  I didn't mean to change that.

What I meant was to return decoded file names from all file-name
primitives, such as file-name-nondirectory, even if their input was
encoded.

> Which is why I suggest to decode right away in the functions that return
> file names (e.g. directory-files).

We already do that, so there's no issue in that department.

The issue is in the file-name primitives that want to support both
encoded and decoded file names, and as I understand from this
discussion, this feature should stay.

> > But since everybody (at least those who spoke) seem to think this is a
> > w32 only problem, I will solve it for w32 only.
> 
> I think the specific problems you mentioned are mostly non-issues under
> POSIX, but the general problem of deciding which representation to use
> is more general.

I thought this was already decided in favor of decoded file names,
a.k.a. "multibyte strings".  The few calls that pass encoded file
names are rare exceptions, but since we want to keep support for
encoded file names, fixing those few places is not going to buy us
anything except code reshuffling.

The problem with encoded file names is that we have little support for
them.  E.g., we cannot up-/down-case them (except if we know the
encoding is supported by the current locale).  For multibyte encodings
that are not UTF-8, we also cannot scan them by characters, only by
bytes, so e.g. strchr will not generally work reliably.  We are
crippled.

So some things will never work with encoded file names, but I guess no
one cares, because most of those problems go away if the encoding is
UTF-8.  Fine; if no one cares, neither do I.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-25 20:31               ` Eli Zaretskii
@ 2013-01-25 22:28                 ` Stefan Monnier
  2013-01-26 10:54                   ` Eli Zaretskii
  2013-01-26  3:04                 ` Stephen J. Turnbull
  1 sibling, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-25 22:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

> What I meant was to return decoded file names from all file-name
> primitives, such as file-name-nondirectory, even if their input was
> encoded.

It's probably OK to do that, but I wonder why we'd need to do it: under
what circumstances could such a primitive receive an encoded file-name,
if all the file names returned to Elisp (by things like directory-files)
are already decoded?

>> Which is why I suggest to decode right away in the functions that return
>> file names (e.g. directory-files).
> We already do that, so there's no issue in that department.

Good.

> The issue is in the file-name primitives that want to support both
> encoded and decoded file names, and as I understand from this
> discussion, this feature should stay.

Of course, we shouldn't just reject encoded filenames, but I don't see
why we should worry too much about them.

> So some things will never work with encoded file names, but I guess no
> one cares, because most of those problems go away if the encoding is
> UTF-8.  Fine; if no one cares, neither do I.

Actually, even with other coding systems, this shouldn't be a serious
issue since encoded file names should be rare.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-25 20:31               ` Eli Zaretskii
  2013-01-25 22:28                 ` Stefan Monnier
@ 2013-01-26  3:04                 ` Stephen J. Turnbull
  2013-01-26 11:27                   ` Eli Zaretskii
  2013-01-26 16:05                   ` Richard Stallman
  1 sibling, 2 replies; 48+ messages in thread
From: Stephen J. Turnbull @ 2013-01-26  3:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, Stefan Monnier, emacs-devel

Eli Zaretskii writes:

 > We are crippled.

Appendicitis feels that way while you have it.  Cut out the inflamed
appendix and in a couple days you are as functional as ever.

"Unibyte" as implemented in Emacs is a premature optimization, and a
disaster in search of places to happen.  Remove it, and you'll never
notice it's gone.  The consequence of that removal would be to fix
this problem, permanently.

As Stefan says, there would remain a more general problem that -- with
the exception of Windows Unicode APIs -- that there is no absolutely
reliable way of determining the user's intended encoding.  However,
the only important cases where this interferes with usual filename
parsing needs are Shift JIS and Big 5 on Windows, where you *do* have
that absolutely reliable alternative.  (Users who encode file names to
Shift JIS or ISO-2022-JP on POSIX file systems deserve what they get,
and Emacs is by far not the only executioner.  POSIX specifies that
directory entry names are byte sequences, so all apps that use file
names are susceptible to these bugs.)

The right thing to do in some sense is to have an "external file name
type" which stores both the Emacs string name and (if the name was
received as bytes from outside) a representation of those bytes.
Rather than change the Lisp_String structure, I would recommend
putting a property (`text-as-received', `externally-coded-text', or
whatever) on the string.  The content of the property would be the
filename decoded as 'binary (or perhaps using Emacs's
undecodable-bytes representation).

Although Emacs doesn't seem to have string properties (ie, on the
object), one can put a text property on the string (or use an overlay,
which might work for the degenerate case of a 0-length string).  This
would allow callers (and sufficiently Type A users) to retry decoding
with a different encoding.

Of course this requires rather smart callers if they slice-n-dice the
file name.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-25 22:28                 ` Stefan Monnier
@ 2013-01-26 10:54                   ` Eli Zaretskii
  2013-01-26 11:34                     ` Stefan Monnier
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-26 10:54 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de
> Date: Fri, 25 Jan 2013 17:28:40 -0500
> 
> > What I meant was to return decoded file names from all file-name
> > primitives, such as file-name-nondirectory, even if their input was
> > encoded.
> 
> It's probably OK to do that, but I wonder why we'd need to do it

It's not a goal in itself, it's a side effect: if every primitive
decodes any encoded file name on entry, it will thereafter manipulate
decoded strings throughout its execution, and will therefore return a
decoded string.  (We could, of course, encode it back if we found the
argument encoded, but then it isn't exactly clear what to do when some
arguments are encoded, the others aren't; and if some of them are
pure-ASCII, they are not easily distinguished from encoded file names.)

> under what circumstances could such a primitive receive an encoded
> file-name, if all the file names returned to Elisp (by things like
> directory-files) are already decoded?

One way is that a primitive gets called from C.  I gave one example of
this in my original message.  There aren't many of such examples, but
if we _want_ to support encoded file names, the code needs to DTRT
with them, even if this happens only once in a blue moon.

> > The issue is in the file-name primitives that want to support both
> > encoded and decoded file names, and as I understand from this
> > discussion, this feature should stay.
> 
> Of course, we shouldn't just reject encoded filenames, but I don't see
> why we should worry too much about them.

I "worry" because they need separate code, especially with multibyte
encodings; writing that code for an encoding not supported by the
current locale is tricky at best, if not downright impossible, and
certainly inefficient.  Are you saying that since this happens
infrequently, we could process such file names in a broken way,
e.g. finding a directory separator where there's none, as demonstrated
in http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13515#5?

> > So some things will never work with encoded file names, but I guess no
> > one cares, because most of those problems go away if the encoding is
> > UTF-8.  Fine; if no one cares, neither do I.
> 
> Actually, even with other coding systems, this shouldn't be a serious
> issue since encoded file names should be rare.

The code needs to be there anyway.  We cannot remove it, and we cannot
break it, because people will complain.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26  3:04                 ` Stephen J. Turnbull
@ 2013-01-26 11:27                   ` Eli Zaretskii
  2013-01-26 13:03                     ` Stephen J. Turnbull
  2013-01-26 16:05                   ` Richard Stallman
  1 sibling, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-26 11:27 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: kzhr, michael.albinus, monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>     kzhr@d1.dion.ne.jp,
>     michael.albinus@gmx.de,
>     emacs-devel@gnu.org
> Date: Sat, 26 Jan 2013 12:04:50 +0900
> 
> "Unibyte" as implemented in Emacs is a premature optimization, and a
> disaster in search of places to happen.  Remove it, and you'll never
> notice it's gone.  The consequence of that removal would be to fix
> this problem, permanently.

I don't think you are entirely correct.  We still need to send encoded
(unibyte) strings to the outside world.  IOW, file names are not the
only user of unibyte strings.

> As Stefan says, there would remain a more general problem that -- with
> the exception of Windows Unicode APIs -- that there is no absolutely
> reliable way of determining the user's intended encoding.

That's a non-issue: we treat unibyte file names as encoded in
file-name-coding-system.  Nothing else is supported, or needed.

> However, the only important cases where this interferes with usual
> filename parsing needs are Shift JIS and Big 5 on Windows, where you
> *do* have that absolutely reliable alternative.

Again, detecting the encoding is a non-issue.  When I see an encoded
file name, I always _know_ how it was encoded, and I can decode it by
using DECODE_FILE.

> The right thing to do in some sense is to have an "external file name
> type" which stores both the Emacs string name and (if the name was
> received as bytes from outside) a representation of those bytes.
> Rather than change the Lisp_String structure, I would recommend
> putting a property (`text-as-received', `externally-coded-text', or
> whatever) on the string.  The content of the property would be the
> filename decoded as 'binary (or perhaps using Emacs's
> undecodable-bytes representation).
> 
> Although Emacs doesn't seem to have string properties (ie, on the
> object), one can put a text property on the string (or use an overlay,
> which might work for the degenerate case of a 0-length string).  This
> would allow callers (and sufficiently Type A users) to retry decoding
> with a different encoding.
> 
> Of course this requires rather smart callers if they slice-n-dice the
> file name.

Exactly.  Moreover, what you suggest is a large project that won't
happen without a motivated individual.  Given the overall "cannot
happen on POSIX, so it's SEP" reaction I got to this thread, what do
you think are the chances of such a project to materialize any time
soon?

And that is even before we start to talk about the details of your
proposal and consider its downsides (what to do when
file-name-coding-system is changed, too many overlays adversely impact
performance, ...).



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 10:54                   ` Eli Zaretskii
@ 2013-01-26 11:34                     ` Stefan Monnier
  2013-01-26 13:16                       ` Eli Zaretskii
  2013-01-26 13:20                       ` Stephen J. Turnbull
  0 siblings, 2 replies; 48+ messages in thread
From: Stefan Monnier @ 2013-01-26 11:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

>> under what circumstances could such a primitive receive an encoded
>> file-name, if all the file names returned to Elisp (by things like
>> directory-files) are already decoded?
> One way is that a primitive gets called from C.

So we should fix the (C) caller.

> if we _want_ to support encoded file names, the code needs to DTRT
> with them, even if this happens only once in a blue moon.

I think the right thing to do with unibyte file names is to treat them
as a sequence of bytes, not a sequence of encoded chars.  If the caller
doesn't like it, then she should pass a decoded file name instead.

> I "worry" because they need separate code,

I think if we only support "sequences of bytes" (unibyte strings) and
"sequenced of decoded chars" (multibyte strings), there is not much need
for separating the code since there's no risk of a special char (like
"/", "." or ":") char appearing there while it meant something else.

> especially with multibyte encodings; writing that code for an encoding
> not supported by the current locale is tricky at best, if not
> downright impossible, and certainly inefficient.

Better not second guess the caller about which encoding she meant.

> Are you saying that since this happens
> infrequently, we could process such file names in a broken way,

Right.

> e.g. finding a directory separator where there's none, as demonstrated
> in http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13515#5?

That seems like a real bug, tho:

   (let ((file-name-coding-system 'cp932))
     (expand-file-name "表" "C:/"))

should not return "c:/\225/".  Why does it even pay attention to
file-name-coding-system?


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 11:27                   ` Eli Zaretskii
@ 2013-01-26 13:03                     ` Stephen J. Turnbull
  2013-01-26 13:36                       ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Stephen J. Turnbull @ 2013-01-26 13:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii writes:

 > > "Unibyte" as implemented in Emacs is a premature optimization, and a
 > > disaster in search of places to happen.  Remove it, and you'll never
 > > notice it's gone.  The consequence of that removal would be to fix
 > > this problem, permanently.
 > 
 > I don't think you are entirely correct.

My preferred flavor of Emacs never had unibyte.  It's got its problems
in this area, but they're just lazy or over-ambitious programmer bugs,
not a design flaw.

 > We still need to send encoded (unibyte) strings to the outside
 > world.

Of course.  In fact, pretty much all interaction with the outside
world involves byte streams.  The problem Emacs is experiencing here
is that Lisp can see bytes when it is designed only to work with
characters.

 > [Determining file name encoding] a non-issue: we treat unibyte file
 > names as encoded in file-name-coding-system.  Nothing else is
 > supported, or needed.

It is in Japan, where it's still common to have a host whose hard
drive uses UTF-8, mounting EUC-JP-encoded volumes over NFS, and USB
drives with Shift-JIS file names.  I've even seen file names
containing segments encoded variously in KOI8, Shift JIS, *and* EUC-JP
(in Macintosh notation, no less).  Admittedly, not in a very long
time, but it's still *possible* to do that on POSIX systems.

You just can't win in this environment; you will see mojibake, and
sometimes undecodable names, unless you get help from the user.  Such
names can be round-tripped using special "undecodable bytes"
representation (UTF-8B or non-unicode code points).  But if you try to
manipulate those names in Lisp, you will sometimes get incorrect
results.

 > Exactly.  Moreover, what you suggest is a large project that won't
 > happen without a motivated individual.  Given the overall "cannot
 > happen on POSIX, so it's SEP"

It can easily happen on POSIX systems, especially with removable media
or double-booting hosts.  The problem is that most people don't care
about Japanese or Chinese, and of those that do, I'm sure most think
that Shift JIS and Big5 are abominations (except for a few Windows
users).

 > reaction I got to this thread, what do you think are the chances of
 > such a project to materialize any time soon?

Not my problem, either.  My preferred flavor of Emacs hasn't had
unibyte-related issues since 1998.

But I don't see why it should be so difficult.  You already have all
the functions needed to decode byte streams to Lisp strings or
buffers, and that's the normal mode of operation, no?  In fact AFAIK
the set of programs that use the unibyte feature at all is pretty
small, and most of those (like Tramp) do so only in self-defense.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 11:34                     ` Stefan Monnier
@ 2013-01-26 13:16                       ` Eli Zaretskii
  2013-01-26 22:11                         ` Stefan Monnier
  2013-01-26 13:20                       ` Stephen J. Turnbull
  1 sibling, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-26 13:16 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de
> Date: Sat, 26 Jan 2013 06:34:16 -0500
> 
> >> under what circumstances could such a primitive receive an encoded
> >> file-name, if all the file names returned to Elisp (by things like
> >> directory-files) are already decoded?
> > One way is that a primitive gets called from C.
> 
> So we should fix the (C) caller.

OK, but as long as file-name primitives are required to support
unibyte strings, you cannot be sure these situations won't pop up in
the future.

> I think the right thing to do with unibyte file names is to treat them
> as a sequence of bytes, not a sequence of encoded chars.  If the caller
> doesn't like it, then she should pass a decoded file name instead.

This effectively means we don't support them _as_file_names_.
Because, e.g., testing individual bytes for equality to something like
'\\' can trip on multibyte (DBCS) encodings if the trailing byte
happens to be '\\'.  In general, it isn't "safe" to iterate over these
strings one byte at a time.

> > I "worry" because they need separate code,
> 
> I think if we only support "sequences of bytes" (unibyte strings) and
> "sequenced of decoded chars" (multibyte strings), there is not much need
> for separating the code since there's no risk of a special char (like
> "/", "." or ":") char appearing there while it meant something else.

See above: the risk is real, at least on MS-Windows.  That's what
these bugs I've been mentioning are all about.

> > especially with multibyte encodings; writing that code for an encoding
> > not supported by the current locale is tricky at best, if not
> > downright impossible, and certainly inefficient.
> 
> Better not second guess the caller about which encoding she meant.

We invariably assume that the encoding is given by
file-name-coding-system (or by default-file-name-coding-system, if
file-name-coding-system is nil).  I don't see any reason to support
anything else.  Lisp code can always bind file-name-coding-system if
it needs a different encoding.

> > Are you saying that since this happens
> > infrequently, we could process such file names in a broken way,
> 
> Right.

He, I don't think this will be well accepted.

> > e.g. finding a directory separator where there's none, as demonstrated
> > in http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13515#5?
> 
> That seems like a real bug, tho:

Of course, it's a real bug!  This is what will happen, at least on
Windows, if we decide not to pay attention to encoded file names.
Which is what we do now, in many places.

>    (let ((file-name-coding-system 'cp932))
>      (expand-file-name "表" "C:/"))
> 
> should not return "c:/\225/".  Why does it even pay attention to
> file-name-coding-system?

Because it encodes the file name it passes to dostounix_filename.  And
it does that because dostounix_filename needs optionally to downcase
the name (when w32-downcase-file-names is set).  The way
dostounix_filename downcases file names depends on the current locale,
so it must get encoded file names.

It is easy enough to fix dostounix_filename, so that it doesn't
require encoded file names.  But while I reviewed the code that
calls dostounix_filename, I found that I couldn't figure out what were
the requirements for such code, and that's why I started this thread:
to understand the requirements.  For example, we find this in
file-name-directory:

    while (p != beg && !IS_DIRECTORY_SEP (p[-1])
  #ifdef DOS_NT
	   /* only recognize drive specifier at the beginning */
	   && !(p[-1] == ':'
		/* handle the "/:d:foo" and "/:foo" cases correctly  */
		&& ((p == beg + 2 && !IS_DIRECTORY_SEP (*beg))
		    || (p == beg + 4 && IS_DIRECTORY_SEP (*beg))))
  #endif
	   ) p--;

If p points to an encoded file name, we could think we found a
backslash in p[-1], where in fact it's a trailing byte of a multibyte
sequence.  And this is just an example; just search for
IS_DIRECTORY_SEP and you will find quite a bit more.

If file-name-directory manipulates only decoded file names in their
internal representation, then such problems will never happen, because
UTF-8 precludes them.  Thus my question whether we want to support
encoded file names in these primitives as first-class citizens.  And I
still cannot figure out the answer ;-)





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 11:34                     ` Stefan Monnier
  2013-01-26 13:16                       ` Eli Zaretskii
@ 2013-01-26 13:20                       ` Stephen J. Turnbull
  1 sibling, 0 replies; 48+ messages in thread
From: Stephen J. Turnbull @ 2013-01-26 13:20 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier writes:

 > I think if we only support "sequences of bytes" (unibyte strings)

What justifies the effort of supporting unibyte?  Nobody writes
byte-shoveling applications like high-performance network servers in
Emacs Lisp and nobody is likely to start doing so.  High-performance
text processing needs to deal with characters, so you're not really
losing anything.

 > That seems like a real bug, tho:
 > 
 >    (let ((file-name-coding-system 'shift-jis))
 >      (expand-file-name "表" "C:/"))
 > 
 > should not return "c:/\225/".

OMG.






^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 13:03                     ` Stephen J. Turnbull
@ 2013-01-26 13:36                       ` Eli Zaretskii
  2013-01-26 16:26                         ` Paul Eggert
  2013-01-26 17:10                         ` Stephen J. Turnbull
  0 siblings, 2 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-26 13:36 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <turnbull@sk.tsukuba.ac.jp>
> Cc: emacs-devel@gnu.org
> Date: Sat, 26 Jan 2013 22:03:28 +0900
> 
> Eli Zaretskii writes:
> 
>  > > "Unibyte" as implemented in Emacs is a premature optimization, and a
>  > > disaster in search of places to happen.  Remove it, and you'll never
>  > > notice it's gone.  The consequence of that removal would be to fix
>  > > this problem, permanently.
>  > 
>  > I don't think you are entirely correct.
> 
> My preferred flavor of Emacs never had unibyte.  It's got its problems
> in this area, but they're just lazy or over-ambitious programmer bugs,
> not a design flaw.

I can't reason about something I know nothing about.  So this is not a
useful argument.

>  > We still need to send encoded (unibyte) strings to the outside
>  > world.
> 
> Of course.  In fact, pretty much all interaction with the outside
> world involves byte streams.  The problem Emacs is experiencing here
> is that Lisp can see bytes when it is designed only to work with
> characters.

In GNU Emacs, Lisp can work with bytes as well.

>  > [Determining file name encoding] a non-issue: we treat unibyte file
>  > names as encoded in file-name-coding-system.  Nothing else is
>  > supported, or needed.
> 
> It is in Japan, where it's still common to have a host whose hard
> drive uses UTF-8, mounting EUC-JP-encoded volumes over NFS, and USB
> drives with Shift-JIS file names.  I've even seen file names
> containing segments encoded variously in KOI8, Shift JIS, *and* EUC-JP
> (in Macintosh notation, no less).  Admittedly, not in a very long
> time, but it's still *possible* to do that on POSIX systems.
> 
> You just can't win in this environment; you will see mojibake, and
> sometimes undecodable names, unless you get help from the user.  Such
> names can be round-tripped using special "undecodable bytes"
> representation (UTF-8B or non-unicode code points).  But if you try to
> manipulate those names in Lisp, you will sometimes get incorrect
> results.

That's OK.  Emacs cannot solve these situations, and I didn't try to
target them.  I will be happy enough to correctly support file names
consistently encoded in a single encoding that is the value of
file-name-coding-system.  I hope you will agree that having _that_
broken is not good.

>  > Exactly.  Moreover, what you suggest is a large project that won't
>  > happen without a motivated individual.  Given the overall "cannot
>  > happen on POSIX, so it's SEP"
> 
> It can easily happen on POSIX systems, especially with removable media
> or double-booting hosts.

If you look back at this thread, you will see that this is what I
tried to say, but was consistently told that Posix systems have no
such problems "in practice".

> But I don't see why it should be so difficult.  You already have all
> the functions needed to decode byte streams to Lisp strings or
> buffers, and that's the normal mode of operation, no?

Decoding is not a problem, but it hampers efficiency.  There's also an
associated problem that decoding a file can GC, which is not good for
functions that get 'char *' pointers as arguments.  Therefore, it is
best avoided (although we do use it when we have no choice, e.g., when
we need to produce a file name from a unibyte directory and a
multibyte file name).

> In fact AFAIK the set of programs that use the unibyte feature at
> all is pretty small, and most of those (like Tramp) do so only in
> self-defense.

You are thinking on the wrong level.  The problem rears its ugly head
on the C level, not on the Lisp level.  Functions in dired.c and
fileio.c manipulate file names, assuming it is safe to address
individual bytes even if the file name is in some DBCS encoding.  I
gave one example a few messages ago.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26  3:04                 ` Stephen J. Turnbull
  2013-01-26 11:27                   ` Eli Zaretskii
@ 2013-01-26 16:05                   ` Richard Stallman
  2013-01-26 17:57                     ` Stephen J. Turnbull
  2013-01-26 22:16                     ` Stefan Monnier
  1 sibling, 2 replies; 48+ messages in thread
From: Richard Stallman @ 2013-01-26 16:05 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: eliz, kzhr, michael.albinus, monnier, emacs-devel

Removing unibyte mode could probably be a big slowdown for visiting
binary files, and might make it unreliable.  This would need to be
checked.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 13:36                       ` Eli Zaretskii
@ 2013-01-26 16:26                         ` Paul Eggert
  2013-01-26 18:30                           ` Stephen J. Turnbull
  2013-01-26 17:10                         ` Stephen J. Turnbull
  1 sibling, 1 reply; 48+ messages in thread
From: Paul Eggert @ 2013-01-26 16:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stephen J. Turnbull, emacs-devel

>> >  > what you suggest is a large project that won't
>> >  > happen without a motivated individual.  Given the overall "cannot
>> >  > happen on POSIX, so it's SEP"
>> > 
>> > It can easily happen on POSIX systems, especially with removable media
>> > or double-booting hosts.
> If you look back at this thread, you will see that this is what I
> tried to say, but was consistently told that Posix systems have no
> such problems "in practice".

I don't think Stephen and I were talking about the same thing.
Stephen's reference to mojibake was talking about having various
files scattered around the system, with file names using different
encodings, and that can easily happen on POSIX systems.  But as I
understand it, we're not trying to solve that problem -- Emacs
will see mojibake in that situation, and users will just have to
deal with it.

Regardless of whether the mojibake problem is present,
Emacs is OK on a POSIX system without worrying about
this issue, since file names are safe even if they're
encoded in Shift-JIS or Big5.  Moreover, a file name is
safe even if it has some parts encoded in Shift-JIS
and other parts encoded in Big5, so that it looks like
gibberish on the screen.  That is because none of these
encoding usurp '/' or '.'.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 13:36                       ` Eli Zaretskii
  2013-01-26 16:26                         ` Paul Eggert
@ 2013-01-26 17:10                         ` Stephen J. Turnbull
  2013-01-26 17:33                           ` Eli Zaretskii
  1 sibling, 1 reply; 48+ messages in thread
From: Stephen J. Turnbull @ 2013-01-26 17:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

I have to say I'm depressed: it is indeed sounding like a fair amount
of work, even without trying to get rid of the root cause.

Eli Zaretskii writes:

 > > My preferred flavor of Emacs never had unibyte.  It's got its problems
 > > in this area, but they're just lazy or over-ambitious programmer bugs,
 > > not a design flaw.
 > 
 > I can't reason about something I know nothing about.  So this is not a
 > useful argument.

Sure it is.  XEmacs is a pretty good facsimile of Emacs-compatibility;
the regular howls from people who want to support XEmacs when Emacs
does something to break compability are proof of that.  Nevertheless,
we've never needed unibyte, and our *-as-unibyte functions are no-ops,
and nobody has ever complained about that (a fact that remains
somewhat surprising to me).

 > > Of course.  In fact, pretty much all interaction with the outside
 > > world involves byte streams.  The problem Emacs is experiencing here
 > > is that Lisp can see bytes when it is designed only to work with
 > > characters.
 > 
 > In GNU Emacs, Lisp can work with bytes as well.

Not very well, historically (\207 bug, the expand-file-name bug Stefan
mentioned).  Nothing to be ashamed of at the counting bugs level:
dealing with the bytes/unicode split has cost Python a huge amount of
effort, and many bugs.  But it was unnecessary in the first place in
Emacs.

 > That's OK.  Emacs cannot solve these situations, and I didn't try to
 > target them.  I will be happy enough to correctly support file names
 > consistently encoded in a single encoding that is the value of
 > file-name-coding-system.  I hope you will agree that having _that_
 > broken is not good.

It's horrible.  I'm just saying that it might very well be worth
biting the bullet and eliminating unibyte instead of trying to patch
up a fundamentally poor design.  Or at least bypass unibyte for these
functions.

 > If you look back at this thread, you will see that this is what I
 > tried to say, but was consistently told that Posix systems have no
 > such problems "in practice".

Your informants evidently don't live in Japan.  In practice it's only
a problem if you need to deal with Shift JIS (cp932), such as on a
thumb drive or SMB mount (ISTR for CIFS Samba uses Unicode somehow
nowadays).  Nobody even thinks about using 7-bit JIS etc; POSIX
systems use either UTF-8 or EUC-JP (which you may recall is
ASCII-compatible, and uses only high-bit-set bytes for Japanese).  I
imagine there are similar issues for some subset of Chinese due to
Big5.

It *is* true that such issues are becoming rarer (but Shift JIS
incompatibility is a monthly annoyance for me because of a broken FTP
server I have to deal with).

 > Decoding is not a problem, but it hampers efficiency.

I'm sorry, but that's, uh, "premature optimization".  If Emacs were a
p-language, you'd have a wooden leg to stand on.[1]  But it's not.
People do not write byte-shoveling applications in Emacs Lisp.  They
do write text-shoveling applications, but to be correct those require
atomic characters, so you need to convert anyway.

 > There's also an associated problem that decoding a file can GC,
 > which is not good for functions that get 'char *' pointers as
 > arguments.

So never give them a char* into a Lisp_String, or inhibit GC when you
do.  But strncpy is plenty fast for this application[2], one hell of a
lot faster than the system calls you make to access a filesystem.
Even strndup is fast enough in our experience.

 > > In fact AFAIK the set of programs that use the unibyte feature at
 > > all is pretty small, and most of those (like Tramp) do so only in
 > > self-defense.
 > 
 > You are thinking on the wrong level.  The problem rears its ugly head
 > on the C level, not on the Lisp level.  Functions in dired.c and
 > fileio.c manipulate file names, assuming it is safe to address
 > individual bytes even if the file name is in some DBCS encoding.

And that's not mediated by Lisp?  I would be surprised if you find any
code paths involving dired that grab a filename from the system, pass
it to a manipulation function, and then try to access the file without
ever storing it in a Lisp object.[3][4]

Footnotes:
[1]  There's plenty of evidence that converting unibyte strings to
Unicode (widechar) in Python 3 doesn't hurt anything but the feelings
of people who assume it's costly but don't benchmark.

[2]  You know that the buffersize is at most PATHMAX + 1.

[3]  Except for very early in initialization of the interpreter, when
Emacs is still finding pieces of itself.

[4]  Indeed those were among the earliest files to be fully Mule-ized
in XEmacs, which in XEmacs means that textual data received from
outside of XEmacs is immediately converted to internal representation,
and only converted back to external representation immediately before
the system library call or kernel call that consumes it.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 17:10                         ` Stephen J. Turnbull
@ 2013-01-26 17:33                           ` Eli Zaretskii
  2013-01-26 18:06                             ` Paul Eggert
                                               ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-26 17:33 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: emacs-devel@gnu.org
> Date: Sun, 27 Jan 2013 02:10:54 +0900
> 
>  > > My preferred flavor of Emacs never had unibyte.  It's got its problems
>  > > in this area, but they're just lazy or over-ambitious programmer bugs,
>  > > not a design flaw.
>  > 
>  > I can't reason about something I know nothing about.  So this is not a
>  > useful argument.
> 
> Sure it is.  XEmacs is a pretty good facsimile of Emacs-compatibility;
> the regular howls from people who want to support XEmacs when Emacs
> does something to break compability are proof of that.  Nevertheless,
> we've never needed unibyte, and our *-as-unibyte functions are no-ops,
> and nobody has ever complained about that (a fact that remains
> somewhat surprising to me).

Every solution of a problem has its downsides and its upsides.  I'm
saying that I cannot consider them in this case and therefore cannot
tell you whether on balance it is better than what Emacs does now.

>  > > Of course.  In fact, pretty much all interaction with the outside
>  > > world involves byte streams.  The problem Emacs is experiencing here
>  > > is that Lisp can see bytes when it is designed only to work with
>  > > characters.
>  > 
>  > In GNU Emacs, Lisp can work with bytes as well.
> 
> Not very well, historically (\207 bug, the expand-file-name bug Stefan
> mentioned).  Nothing to be ashamed of at the counting bugs level:
> dealing with the bytes/unicode split has cost Python a huge amount of
> effort, and many bugs.  But it was unnecessary in the first place in
> Emacs.

It _is_ necessary because file names passed to system APIs _must_ be
encoded.  That's where the bugs mentioned here (already fixed, btw)
happen: in the implementation of 'stat' we have in Emacs that does a
better job than the MS runtime, and in other similar cases.

> 
>  > That's OK.  Emacs cannot solve these situations, and I didn't try to
>  > target them.  I will be happy enough to correctly support file names
>  > consistently encoded in a single encoding that is the value of
>  > Decoding is not a problem, but it hampers efficiency.
> 
> I'm sorry, but that's, uh, "premature optimization".

It's not premature.  directory-files-and-attributes, used on Windows
to emulate 'ls', must be fast enough even in large directories,
because otherwise Dired will be painfully slow to start.  As things
are, things are too slow already, especially with remote filesystems;
there were bug reports about this last year.  IOW, the current
implementation is already borderline performance-wise.

>  > There's also an associated problem that decoding a file can GC,
>  > which is not good for functions that get 'char *' pointers as
>  > arguments.
> 
> So never give them a char* into a Lisp_String, or inhibit GC when you
> do.  But strncpy is plenty fast for this application[2], one hell of a
> lot faster than the system calls you make to access a filesystem.
> Even strndup is fast enough in our experience.

It's not rocket science, true.  I'm just saying that if it isn't
required, it's best avoided.

>  > > In fact AFAIK the set of programs that use the unibyte feature at
>  > > all is pretty small, and most of those (like Tramp) do so only in
>  > > self-defense.
>  > 
>  > You are thinking on the wrong level.  The problem rears its ugly head
>  > on the C level, not on the Lisp level.  Functions in dired.c and
>  > fileio.c manipulate file names, assuming it is safe to address
>  > individual bytes even if the file name is in some DBCS encoding.
> 
> And that's not mediated by Lisp?  I would be surprised if you find any
> code paths involving dired that grab a filename from the system, pass
> it to a manipulation function, and then try to access the file without
> ever storing it in a Lisp object.[3][4]

I gave examples in this thread that should make you surprised.

In any case, as long as file-name primitives support unibyte (encoded)
file names, there's nothing to prevent such examples from popping up.
Programmers are not disciplined enough to trust them on this.

> [4]  Indeed those were among the earliest files to be fully Mule-ized
> in XEmacs, which in XEmacs means that textual data received from
> outside of XEmacs is immediately converted to internal representation,
> and only converted back to external representation immediately before
> the system library call or kernel call that consumes it.

No such coding standards in Emacs, and the C code does manipulate
unibyte strings as long as they don't need to be passed to Lisp.  I
suggested converting to internal representation at entry to all
primitives in this thread, but it looks like Stefan disagrees, or at
least not completely agrees.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 16:05                   ` Richard Stallman
@ 2013-01-26 17:57                     ` Stephen J. Turnbull
  2013-01-26 22:16                     ` Stefan Monnier
  1 sibling, 0 replies; 48+ messages in thread
From: Stephen J. Turnbull @ 2013-01-26 17:57 UTC (permalink / raw)
  To: rms; +Cc: eliz, kzhr, michael.albinus, monnier, emacs-devel

Richard Stallman writes:

 > Removing unibyte mode could probably be a big slowdown for visiting
 > binary files,

This was benchmarked for XEmacs in the early 2000s, and for files big
enough to matter the decoding time is swamped by disk I/O.  If it's
possible to mmap files into buffers, the difference to visiting might
be perceptible, but even SSDs can't transfer fast enough to beat CPUs
at decoding if you actually fill the buffer.

The biggest difference would be in `goto-char', which becomes
O(distance from the nearest point in the position cache).  It would be
possible to initialize the cache at read time, and a cache 1% of the
size of the file would allow you to cache a position every 2KB or so,
for effectively O(1) performance.[1]  We never implemented that, though;
the most important uses for large files were log files, and all of the
people who were using Emacsen to read log files had pure ASCII files
which did allow random access.

 > and might make it unreliable.

Depends on how you implement it.  XEmacs's implementation has never
had a bug that I've heard of.


Footnotes: 
[1]  Since you know the cache is "nearly uniform", you can do better
than O(log(0.0005*size)).




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 17:33                           ` Eli Zaretskii
@ 2013-01-26 18:06                             ` Paul Eggert
  2013-01-26 18:20                               ` Eli Zaretskii
  2013-01-26 18:56                             ` Stephen J. Turnbull
  2013-01-26 21:44                             ` Stefan Monnier
  2 siblings, 1 reply; 48+ messages in thread
From: Paul Eggert @ 2013-01-26 18:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stephen J. Turnbull, emacs-devel

On 01/26/2013 09:33 AM, Eli Zaretskii wrote:
> directory-files-and-attributes, used on Windows
> to emulate 'ls', must be fast enough even in large directories,
> because otherwise Dired will be painfully slow to start.  As things
> are, things are too slow already, especially with remote filesystems;
> there were bug reports about this last year.

Is it possible that those performance problems are due to networking,
or due to the OS overhead of repeatedly interpreting long file names
(addressed for POSIXish systems by the proposed patch in Bug#13539),
rather than due to the overhead of decoding file names retrieved
from directories?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 18:06                             ` Paul Eggert
@ 2013-01-26 18:20                               ` Eli Zaretskii
  0 siblings, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-26 18:20 UTC (permalink / raw)
  To: Paul Eggert; +Cc: stephen, emacs-devel

> Date: Sat, 26 Jan 2013 10:06:26 -0800
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: "Stephen J. Turnbull" <stephen@xemacs.org>, emacs-devel@gnu.org
> 
> On 01/26/2013 09:33 AM, Eli Zaretskii wrote:
> > directory-files-and-attributes, used on Windows
> > to emulate 'ls', must be fast enough even in large directories,
> > because otherwise Dired will be painfully slow to start.  As things
> > are, things are too slow already, especially with remote filesystems;
> > there were bug reports about this last year.
> 
> Is it possible that those performance problems are due to networking,
> or due to the OS overhead of repeatedly interpreting long file names
> (addressed for POSIXish systems by the proposed patch in Bug#13539),
> rather than due to the overhead of decoding file names retrieved
> from directories?

Sorry, I didn't mean to say that the performance problems are due to
decoding.  I meant to say that performance sometimes sucks even
without adding more decoding.

The performance complaints I heard last were due to retrieving owner
and group information from remote files.  That problem was hopefully
solved in trunk revision 111226.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 16:26                         ` Paul Eggert
@ 2013-01-26 18:30                           ` Stephen J. Turnbull
  0 siblings, 0 replies; 48+ messages in thread
From: Stephen J. Turnbull @ 2013-01-26 18:30 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel

Paul Eggert writes:

 > encoded in Shift-JIS or Big5.  Moreover, a file name is
 > safe even if it has some parts encoded in Shift-JIS
 > and other parts encoded in Big5, so that it looks like
 > gibberish on the screen.  That is because none of these
 > encoding usurp '/' or '.'.

They do however usurp '\', which has meaning in regexps and to shells,
as well as some other characters that have meaning in those contexts.
So you're OK as long as the filename is passed directly to the kernel,
but not if it's used in many other external contexts.

It also may imply that some unibyte strings cannot be correctly
regexp-quoted when used in Emacs (because some of the backslashes
arise from encoded characters in file names, and others were
introduced by the user via regexp constructs).

So it's not so easy to escape by noting that POSIX file names as such
are safe.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 17:33                           ` Eli Zaretskii
  2013-01-26 18:06                             ` Paul Eggert
@ 2013-01-26 18:56                             ` Stephen J. Turnbull
  2013-01-26 21:40                               ` Stefan Monnier
  2013-01-26 21:44                             ` Stefan Monnier
  2 siblings, 1 reply; 48+ messages in thread
From: Stephen J. Turnbull @ 2013-01-26 18:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii writes:

 > > But it was unnecessary in the first place in Emacs.
 > 
 > It _is_ necessary because file names passed to system APIs _must_ be
 > encoded.  That's where the bugs mentioned here (already fixed, btw)
 > happen: in the implementation of 'stat' we have in Emacs that does a
 > better job than the MS runtime, and in other similar cases.

Of course they need to be encoded file names.  The question is, were
they stored in encoded form, or were they encoded just before passing
them to the OS?  I'm recommending the latter strategy.

 > It's not premature.  directory-files-and-attributes, used on Windows
 > to emulate 'ls', must be fast enough even in large directories,
 > because otherwise Dired will be painfully slow to start.

But encoding and decoding can't add to those performance problems,
because Dired must do the decoding anyway to fill the buffer, and
encoding happens once and is very short.

 > It's not rocket science, true.  I'm just saying that if it isn't
 > required, it's best avoided.

Hey!  That's precisely my point about unibyte.

 > In any case, as long as file-name primitives support unibyte (encoded)
 > file names, there's nothing to prevent such examples from popping up.

True.  They *never* happen in XEmacs though.  They *can't*.  Functions
which manipulate text in XEmacs work on internal representation, never
in encoded form.  That's just too risky, and demands too many special
cases to be reliable.

But I guess you just don't have enough manpower to change this.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 18:56                             ` Stephen J. Turnbull
@ 2013-01-26 21:40                               ` Stefan Monnier
  0 siblings, 0 replies; 48+ messages in thread
From: Stefan Monnier @ 2013-01-26 21:40 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, emacs-devel

> Of course they need to be encoded file names.  The question is, were
> they stored in encoded form, or were they encoded just before passing
> them to the OS?

Elisp should basically never see encoded file names.  The problems
we're discussing are in the C code.

> True.  They *never* happen in XEmacs though.  They *can't*.

The can happen in Emacs, but we don't care a bout them.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 17:33                           ` Eli Zaretskii
  2013-01-26 18:06                             ` Paul Eggert
  2013-01-26 18:56                             ` Stephen J. Turnbull
@ 2013-01-26 21:44                             ` Stefan Monnier
  2013-01-27  6:14                               ` Eli Zaretskii
  2 siblings, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-26 21:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stephen J. Turnbull, emacs-devel

> No such coding standards in Emacs, and the C code does manipulate
> unibyte strings as long as they don't need to be passed to Lisp.
> I suggested converting to internal representation at entry to all
> primitives in this thread, but it looks like Stefan disagrees, or at
> least not completely agrees.

Indeed, I may not completely agree, but I think I don't really know what
is your suggestion because "entry to all primitives" is too vague (I
obviously misunderstood it at first, and even now that I know that it
doesn't mean what I thought it meant, I still don't really know what it
means).


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 13:16                       ` Eli Zaretskii
@ 2013-01-26 22:11                         ` Stefan Monnier
  2013-01-27  7:03                           ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-26 22:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

> OK, but as long as file-name primitives are required to support
> unibyte strings, you cannot be sure these situations won't pop up in
> the future.

I don't see a need to disallow unibyte strings, but I don't see the need
to be particularly careful about it either.  Basically Elisp code which
provides unibyte file names does it at its own risks.

>> I think the right thing to do with unibyte file names is to treat them
>> as a sequence of bytes, not a sequence of encoded chars.  If the caller
>> doesn't like it, then she should pass a decoded file name instead.
> This effectively means we don't support them _as_file_names_.
> Because, e.g., testing individual bytes for equality to something like
> '\\' can trip on multibyte (DBCS) encodings if the trailing byte
> happens to be '\\'.  In general, it isn't "safe" to iterate over these
> strings one byte at a time.

But that's exactly the behavior stipulated by POSIX (tho for '/' rather
than '\\').  I.e. if you use file names on a POSIX host with
a coding-system that occasionally uses '/' within its multibyte
sequences, you'll get those surprises regardless of Emacs.  And for that
reason, Emacs would be right to cut those file names in the middle of
a multibyte sequence.

IIUC that's what makes this a "w32-only problem", because the w32
semantics for file names is based on characters, so a '\\' (or a '/')
appearing with a multibyte sequence is not considered by the OS as
a separator.

And since Emacs is largely based on "POSIX semantics for the generic
code, plus an emulation layer in w32.c", we have a problem of subtly
incompatible semantics.

>> > Are you saying that since this happens
>> > infrequently, we could process such file names in a broken way,
>> Right.
> He, I don't think this will be well accepted.

I haven't heard too many screams about this over the years.

>> > e.g. finding a directory separator where there's none, as demonstrated
>> > in http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13515#5?
>> That seems like a real bug, tho:
> Of course, it's a real bug!  This is what will happen, at least on
> Windows, if we decide not to pay attention to encoded file names.
> Which is what we do now, in many places.

>> (let ((file-name-coding-system 'cp932))
>> (expand-file-name "表" "C:/"))
>> should not return "c:/\225/".  Why does it even pay attention to
>> file-name-coding-system?
> Because it encodes the file name it passes to dostounix_filename.

Why? [ OK, I see the answer follows.. ]

> And it does that because dostounix_filename needs optionally to
> downcase the name (when w32-downcase-file-names is set).

Hmm.. but downcasing is an operation on chars, not on bytes, so it
should be applied to decoded names, right?

> The way dostounix_filename downcases file names depends on the current
> locale, so it must get encoded file names.

Are you saying that the "downcase" function is not Emacs's own but is
a function provided by the OS, so we need to encode the name to pass it
to that function?  If so, we need to immediately decode the result.
(and of course this encode+downcase+decode is only done if
w32-downcase-file-names is set).

Alternatively, we could use Emacs's own downcasing function, which does
not depend on the locale and operates directly on decoded names.

> If p points to an encoded file name, we could think we found a
> backslash in p[-1], where in fact it's a trailing byte of a multibyte
> sequence.  And this is just an example; just search for
> IS_DIRECTORY_SEP and you will find quite a bit more.

As explained elsewhere such "spurious directory separator within
a multibyte char" has a different meaning under w32 than under POSIX.
The current Emacs code is correct in this respect under POSIX (as odd as
it may sound).

Luckily the problem should only appear if such code is run on unibyte
names and that should be rare enough (in the generic part of the C code)
that we don't need to worry about it.

But indeed for uses of IS_DIRECTORY_SEP in w32.c this is probably more
serious since those functions emulate POSIX calls, so they always receive
encoded file names.

> UTF-8 precludes them.  Thus my question whether we want to support
> encoded file names in these primitives as first-class citizens.

Could you specify a bit more precisely which primitives you have
in mind?


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 16:05                   ` Richard Stallman
  2013-01-26 17:57                     ` Stephen J. Turnbull
@ 2013-01-26 22:16                     ` Stefan Monnier
  1 sibling, 0 replies; 48+ messages in thread
From: Stefan Monnier @ 2013-01-26 22:16 UTC (permalink / raw)
  To: rms; +Cc: Stephen J. Turnbull, kzhr, michael.albinus, eliz, emacs-devel

> Removing unibyte mode could probably be a big slowdown for visiting
> binary files, and might make it unreliable.

We do have some performance bugs in some related operations that we
might need to fix id we wanted to go down that way (IIRC, converting
a big unibyte buffer to multibyte, with many bytes >127, takes time
O(N^2)).

> This would need to be checked.

I don't see the need.  The problems we're facing are unrelated.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 21:44                             ` Stefan Monnier
@ 2013-01-27  6:14                               ` Eli Zaretskii
  0 siblings, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-27  6:14 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: stephen, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: "Stephen J. Turnbull" <stephen@xemacs.org>,  emacs-devel@gnu.org
> Date: Sat, 26 Jan 2013 16:44:55 -0500
> 
> > No such coding standards in Emacs, and the C code does manipulate
> > unibyte strings as long as they don't need to be passed to Lisp.
> > I suggested converting to internal representation at entry to all
> > primitives in this thread, but it looks like Stefan disagrees, or at
> > least not completely agrees.
> 
> Indeed, I may not completely agree, but I think I don't really know what
> is your suggestion because "entry to all primitives" is too vague (I
> obviously misunderstood it at first, and even now that I know that it
> doesn't mean what I thought it meant, I still don't really know what it
> means).

It means this:

DEFUN ("file-name-directory", Ffile_name_directory, Sfile_name_directory,
       1, 1, 0,
       doc: /* Return the directory component in file name FILENAME.
Return nil if FILENAME does not include a directory.
Otherwise return a directory name.
Given a Unix syntax file name, returns a string ending in slash.  */)
  (Lisp_Object filename)
{
#ifndef DOS_NT
  register const char *beg;
#else
  register char *beg;
  Lisp_Object tem_fn;
#endif
  register const char *p;
  Lisp_Object handler;

  CHECK_STRING (filename);
  if (!STRING_MULTIBYTE (filename))      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    filename = DECODE_FILE (filename);   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-26 22:11                         ` Stefan Monnier
@ 2013-01-27  7:03                           ` Eli Zaretskii
  2013-01-27  8:46                             ` Andreas Schwab
  2013-01-28  1:55                             ` Stefan Monnier
  0 siblings, 2 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-27  7:03 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de
> Date: Sat, 26 Jan 2013 17:11:25 -0500
> 
> > OK, but as long as file-name primitives are required to support
> > unibyte strings, you cannot be sure these situations won't pop up in
> > the future.
> 
> I don't see a need to disallow unibyte strings, but I don't see the need
> to be particularly careful about it either.  Basically Elisp code which
> provides unibyte file names does it at its own risks.

What about C code that calls these primitives?  Can we consider every
such instance a bug in the caller?  If so, we could stop catering to
unibyte strings in these primitives, which will make at least some of
them a whole lot simpler.

> >> I think the right thing to do with unibyte file names is to treat them
> >> as a sequence of bytes, not a sequence of encoded chars.  If the caller
> >> doesn't like it, then she should pass a decoded file name instead.
> > This effectively means we don't support them _as_file_names_.
> > Because, e.g., testing individual bytes for equality to something like
> > '\\' can trip on multibyte (DBCS) encodings if the trailing byte
> > happens to be '\\'.  In general, it isn't "safe" to iterate over these
> > strings one byte at a time.
> 
> But that's exactly the behavior stipulated by POSIX (tho for '/' rather
> than '\\').  I.e. if you use file names on a POSIX host with
> a coding-system that occasionally uses '/' within its multibyte
> sequences, you'll get those surprises regardless of Emacs.  And for that
> reason, Emacs would be right to cut those file names in the middle of
> a multibyte sequence.

Then why did you regard this:

 (let ((file-name-coding-system 'cp932))
   (expand-file-name "表" "C:/"))

  => "c:/\225/"

as a bug?  This is exactly what happens there: the string "表", when
encoded with cp932, has '\' as its last byte.

> IIUC that's what makes this a "w32-only problem", because the w32
> semantics for file names is based on characters, so a '\\' (or a '/')
> appearing with a multibyte sequence is not considered by the OS as
> a separator.
> 
> And since Emacs is largely based on "POSIX semantics for the generic
> code, plus an emulation layer in w32.c", we have a problem of subtly
> incompatible semantics.

Maybe so, but it certainly isn't the only place in Emacs with subtly
incompatible semantics.  And anyway, I don't see how this observation
helps to decide what, if anything, to do to fix this.

> >> > Are you saying that since this happens
> >> > infrequently, we could process such file names in a broken way,
> >> Right.
> > He, I don't think this will be well accepted.
> 
> I haven't heard too many screams about this over the years.

I heard 2 this week, from 2 different users.  Inability to reference
file names that are allowed by the underlying filesystem is a bad bug,
IMO.

> > And it does that because dostounix_filename needs optionally to
> > downcase the name (when w32-downcase-file-names is set).
> 
> Hmm.. but downcasing is an operation on chars, not on bytes, so it
> should be applied to decoded names, right?

That's not how the code was written.  w32.c functions get the strings
that are already encoded.

> > The way dostounix_filename downcases file names depends on the current
> > locale, so it must get encoded file names.
> 
> Are you saying that the "downcase" function is not Emacs's own but is
> a function provided by the OS, so we need to encode the name to pass it
> to that function?

That's how the code works, yes.

> If so, we need to immediately decode the result.

We already do.  Example:

  else if (STRING_MULTIBYTE (filename))
    {
      tem_fn = ENCODE_FILE (make_specified_string (beg, -1, p - beg, 1));
      dostounix_filename (SSDATA (tem_fn));
      tem_fn = DECODE_FILE (tem_fn);
    }

> (and of course this encode+downcase+decode is only done if
> w32-downcase-file-names is set).

Can't do that, because dostounix_filename also mirrors the backslashes
and downcases the drive letter -- independently of
w32-downcase-file-names.  Since dostounix_filename currently operates
only on encoded file names, the above is always done for decoded file
names.

> Alternatively, we could use Emacs's own downcasing function, which does
> not depend on the locale and operates directly on decoded names.

That's what I intend to do, indeed, once the dust settles on this
discussion, and I understand the requirements.

Note that using Emacs's downcase is not a trivial change, because
(AFAIK) accessing the downcase_table can trigger GC.  Also, downcasing
might change the byte count of a multibyte string (due to
unification), so we cannot pass a 'char *' to dostounix_filename.  Not
rocket science, of course, but still...

Alternatively, we could downcase inline in the primitives themselves,
not inside dostounix_filename.

> But indeed for uses of IS_DIRECTORY_SEP in w32.c this is probably more
> serious since those functions emulate POSIX calls, so they always receive
> encoded file names.

I think I already fixed all of them.

> > UTF-8 precludes them.  Thus my question whether we want to support
> > encoded file names in these primitives as first-class citizens.
> 
> Could you specify a bit more precisely which primitives you have
> in mind?

Those in fileio.c and in dired.c.  I could give an explicit list, if
you want.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-27  7:03                           ` Eli Zaretskii
@ 2013-01-27  8:46                             ` Andreas Schwab
  2013-01-27  9:40                               ` Eli Zaretskii
  2013-01-28  1:55                             ` Stefan Monnier
  1 sibling, 1 reply; 48+ messages in thread
From: Andreas Schwab @ 2013-01-27  8:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, Stefan Monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> What about C code that calls these primitives?

It doesn't matter who calls them, they operate on decoded file names.

> If so, we could stop catering to unibyte strings in these primitives,

A pure ASCII string is unibyte by default, but that doesn't mean it is
an encoded string.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-27  8:46                             ` Andreas Schwab
@ 2013-01-27  9:40                               ` Eli Zaretskii
  0 siblings, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-27  9:40 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: kzhr, michael.albinus, monnier, emacs-devel

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de,  emacs-devel@gnu.org
> Date: Sun, 27 Jan 2013 09:46:29 +0100
> 
> > If so, we could stop catering to unibyte strings in these primitives,
> 
> A pure ASCII string is unibyte by default, but that doesn't mean it is
> an encoded string.

Yes, I meant encoded, not unibyte pure-ASCII.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-27  7:03                           ` Eli Zaretskii
  2013-01-27  8:46                             ` Andreas Schwab
@ 2013-01-28  1:55                             ` Stefan Monnier
  2013-01-28 14:44                               ` Eli Zaretskii
  1 sibling, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-28  1:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

>> > OK, but as long as file-name primitives are required to support
>> > unibyte strings, you cannot be sure these situations won't pop up in
>> > the future.
>> I don't see a need to disallow unibyte strings, but I don't see the need
>> to be particularly careful about it either.  Basically Elisp code which
>> provides unibyte file names does it at its own risks.
> What about C code that calls these primitives?  Can we consider every
> such instance a bug in the caller?

Most likely, yes.

>> But that's exactly the behavior stipulated by POSIX (tho for '/' rather
>> than '\\').  I.e. if you use file names on a POSIX host with
>> a coding-system that occasionally uses '/' within its multibyte
>> sequences, you'll get those surprises regardless of Emacs.  And for that
>> reason, Emacs would be right to cut those file names in the middle of
>> a multibyte sequence.
> Then why did you regard this:
>  (let ((file-name-coding-system 'cp932))
>    (expand-file-name "表" "C:/"))
>   => "c:/\225/"
> as a bug?

Because expand-file-name works on Emacs strings, not on
file-system strings.

>> And since Emacs is largely based on "POSIX semantics for the generic
>> code, plus an emulation layer in w32.c", we have a problem of subtly
>> incompatible semantics.
> Maybe so, but it certainly isn't the only place in Emacs with subtly
> incompatible semantics.  And anyway, I don't see how this observation
> helps to decide what, if anything, to do to fix this.

It helps me understand the problem, at least.
Maybe it also points out that we might like to change the interface so
that generic code does not encode strings before passing them to the
OS-specific primitives.

>> Could you specify a bit more precisely which primitives you have
>> in mind?
> Those in fileio.c and in dired.c.  I could give an explicit list, if
> you want.

At least I disagree with your Ffile_name_directory suggestion: if the
file-name is already encoded and it results in bugs, the fix should be
in the caller.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-28  1:55                             ` Stefan Monnier
@ 2013-01-28 14:44                               ` Eli Zaretskii
  2013-01-28 15:21                                 ` Stefan Monnier
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2013-01-28 14:44 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de
> Date: Sun, 27 Jan 2013 20:55:16 -0500
> 
> At least I disagree with your Ffile_name_directory suggestion: if the
> file-name is already encoded and it results in bugs, the fix should be
> in the caller.

So you are saying these primitives should assume decoded file names,
and if called with encoded ones, exhibit undefined behavior, for which
the fix is in the caller, is that right?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-28 14:44                               ` Eli Zaretskii
@ 2013-01-28 15:21                                 ` Stefan Monnier
  2013-02-02 17:19                                   ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Monnier @ 2013-01-28 15:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kzhr, michael.albinus, emacs-devel

>> At least I disagree with your Ffile_name_directory suggestion: if the
>> file-name is already encoded and it results in bugs, the fix should be
>> in the caller.
> So you are saying these primitives should assume decoded file names,
> and if called with encoded ones, exhibit undefined behavior, for which
> the fix is in the caller, is that right?

Pretty much, yes.


        Stefan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Multibyte and unibyte file names
  2013-01-28 15:21                                 ` Stefan Monnier
@ 2013-02-02 17:19                                   ` Eli Zaretskii
  0 siblings, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2013-02-02 17:19 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kzhr, michael.albinus, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de,  emacs-devel@gnu.org
> Date: Mon, 28 Jan 2013 10:21:52 -0500
> 
> >> At least I disagree with your Ffile_name_directory suggestion: if the
> >> file-name is already encoded and it results in bugs, the fix should be
> >> in the caller.
> > So you are saying these primitives should assume decoded file names,
> > and if called with encoded ones, exhibit undefined behavior, for which
> > the fix is in the caller, is that right?
> 
> Pretty much, yes.

OK, I just committed to trunk a changeset (revision 111663) to do
this.  dostounix_filename now accepts an argument telling it whether
the file name is multibyte (i.e. decoded), and fileio.c primitives use
that.  Downcasing of file names under w32-downcase-file-names is now
done by calling Fdowncase on Lisp strings.

Let's see how much will this break.



^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2013-02-02 17:19 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-23 17:45 Multibyte and unibyte file names Eli Zaretskii
2013-01-23 18:08 ` Paul Eggert
2013-01-23 19:04   ` Eli Zaretskii
2013-01-23 23:38     ` Paul Eggert
2013-01-23 19:42 ` Michael Albinus
2013-01-23 20:05   ` Eli Zaretskii
2013-01-23 20:58     ` Michael Albinus
2013-01-24 16:37       ` Eli Zaretskii
2013-01-23 21:09 ` Stefan Monnier
2013-01-24 17:02   ` Eli Zaretskii
2013-01-24 18:25     ` Stefan Monnier
2013-01-24 18:38       ` Eli Zaretskii
2013-01-25  0:06         ` Stefan Monnier
2013-01-25  7:37           ` Eli Zaretskii
2013-01-25 11:36             ` Stefan Monnier
2013-01-25 20:31               ` Eli Zaretskii
2013-01-25 22:28                 ` Stefan Monnier
2013-01-26 10:54                   ` Eli Zaretskii
2013-01-26 11:34                     ` Stefan Monnier
2013-01-26 13:16                       ` Eli Zaretskii
2013-01-26 22:11                         ` Stefan Monnier
2013-01-27  7:03                           ` Eli Zaretskii
2013-01-27  8:46                             ` Andreas Schwab
2013-01-27  9:40                               ` Eli Zaretskii
2013-01-28  1:55                             ` Stefan Monnier
2013-01-28 14:44                               ` Eli Zaretskii
2013-01-28 15:21                                 ` Stefan Monnier
2013-02-02 17:19                                   ` Eli Zaretskii
2013-01-26 13:20                       ` Stephen J. Turnbull
2013-01-26  3:04                 ` Stephen J. Turnbull
2013-01-26 11:27                   ` Eli Zaretskii
2013-01-26 13:03                     ` Stephen J. Turnbull
2013-01-26 13:36                       ` Eli Zaretskii
2013-01-26 16:26                         ` Paul Eggert
2013-01-26 18:30                           ` Stephen J. Turnbull
2013-01-26 17:10                         ` Stephen J. Turnbull
2013-01-26 17:33                           ` Eli Zaretskii
2013-01-26 18:06                             ` Paul Eggert
2013-01-26 18:20                               ` Eli Zaretskii
2013-01-26 18:56                             ` Stephen J. Turnbull
2013-01-26 21:40                               ` Stefan Monnier
2013-01-26 21:44                             ` Stefan Monnier
2013-01-27  6:14                               ` Eli Zaretskii
2013-01-26 16:05                   ` Richard Stallman
2013-01-26 17:57                     ` Stephen J. Turnbull
2013-01-26 22:16                     ` Stefan Monnier
2013-01-24 10:00 ` Michael Albinus
2013-01-24 16:40   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).