all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: kzhr@d1.dion.ne.jp, michael.albinus@gmx.de, emacs-devel@gnu.org
Subject: Re: Multibyte and unibyte file names
Date: Sun, 27 Jan 2013 09:03:08 +0200	[thread overview]
Message-ID: <834ni3jefn.fsf@gnu.org> (raw)
In-Reply-To: <jwva9rvmxdn.fsf-monnier+emacs@gnu.org>

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org,  kzhr@d1.dion.ne.jp,  michael.albinus@gmx.de
> Date: Sat, 26 Jan 2013 17:11:25 -0500
> 
> > OK, but as long as file-name primitives are required to support
> > unibyte strings, you cannot be sure these situations won't pop up in
> > the future.
> 
> I don't see a need to disallow unibyte strings, but I don't see the need
> to be particularly careful about it either.  Basically Elisp code which
> provides unibyte file names does it at its own risks.

What about C code that calls these primitives?  Can we consider every
such instance a bug in the caller?  If so, we could stop catering to
unibyte strings in these primitives, which will make at least some of
them a whole lot simpler.

> >> I think the right thing to do with unibyte file names is to treat them
> >> as a sequence of bytes, not a sequence of encoded chars.  If the caller
> >> doesn't like it, then she should pass a decoded file name instead.
> > This effectively means we don't support them _as_file_names_.
> > Because, e.g., testing individual bytes for equality to something like
> > '\\' can trip on multibyte (DBCS) encodings if the trailing byte
> > happens to be '\\'.  In general, it isn't "safe" to iterate over these
> > strings one byte at a time.
> 
> But that's exactly the behavior stipulated by POSIX (tho for '/' rather
> than '\\').  I.e. if you use file names on a POSIX host with
> a coding-system that occasionally uses '/' within its multibyte
> sequences, you'll get those surprises regardless of Emacs.  And for that
> reason, Emacs would be right to cut those file names in the middle of
> a multibyte sequence.

Then why did you regard this:

 (let ((file-name-coding-system 'cp932))
   (expand-file-name "表" "C:/"))

  => "c:/\225/"

as a bug?  This is exactly what happens there: the string "表", when
encoded with cp932, has '\' as its last byte.

> IIUC that's what makes this a "w32-only problem", because the w32
> semantics for file names is based on characters, so a '\\' (or a '/')
> appearing with a multibyte sequence is not considered by the OS as
> a separator.
> 
> And since Emacs is largely based on "POSIX semantics for the generic
> code, plus an emulation layer in w32.c", we have a problem of subtly
> incompatible semantics.

Maybe so, but it certainly isn't the only place in Emacs with subtly
incompatible semantics.  And anyway, I don't see how this observation
helps to decide what, if anything, to do to fix this.

> >> > Are you saying that since this happens
> >> > infrequently, we could process such file names in a broken way,
> >> Right.
> > He, I don't think this will be well accepted.
> 
> I haven't heard too many screams about this over the years.

I heard 2 this week, from 2 different users.  Inability to reference
file names that are allowed by the underlying filesystem is a bad bug,
IMO.

> > And it does that because dostounix_filename needs optionally to
> > downcase the name (when w32-downcase-file-names is set).
> 
> Hmm.. but downcasing is an operation on chars, not on bytes, so it
> should be applied to decoded names, right?

That's not how the code was written.  w32.c functions get the strings
that are already encoded.

> > The way dostounix_filename downcases file names depends on the current
> > locale, so it must get encoded file names.
> 
> Are you saying that the "downcase" function is not Emacs's own but is
> a function provided by the OS, so we need to encode the name to pass it
> to that function?

That's how the code works, yes.

> If so, we need to immediately decode the result.

We already do.  Example:

  else if (STRING_MULTIBYTE (filename))
    {
      tem_fn = ENCODE_FILE (make_specified_string (beg, -1, p - beg, 1));
      dostounix_filename (SSDATA (tem_fn));
      tem_fn = DECODE_FILE (tem_fn);
    }

> (and of course this encode+downcase+decode is only done if
> w32-downcase-file-names is set).

Can't do that, because dostounix_filename also mirrors the backslashes
and downcases the drive letter -- independently of
w32-downcase-file-names.  Since dostounix_filename currently operates
only on encoded file names, the above is always done for decoded file
names.

> Alternatively, we could use Emacs's own downcasing function, which does
> not depend on the locale and operates directly on decoded names.

That's what I intend to do, indeed, once the dust settles on this
discussion, and I understand the requirements.

Note that using Emacs's downcase is not a trivial change, because
(AFAIK) accessing the downcase_table can trigger GC.  Also, downcasing
might change the byte count of a multibyte string (due to
unification), so we cannot pass a 'char *' to dostounix_filename.  Not
rocket science, of course, but still...

Alternatively, we could downcase inline in the primitives themselves,
not inside dostounix_filename.

> But indeed for uses of IS_DIRECTORY_SEP in w32.c this is probably more
> serious since those functions emulate POSIX calls, so they always receive
> encoded file names.

I think I already fixed all of them.

> > UTF-8 precludes them.  Thus my question whether we want to support
> > encoded file names in these primitives as first-class citizens.
> 
> Could you specify a bit more precisely which primitives you have
> in mind?

Those in fileio.c and in dired.c.  I could give an explicit list, if
you want.




  reply	other threads:[~2013-01-27  7:03 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-23 17:45 Multibyte and unibyte file names Eli Zaretskii
2013-01-23 18:08 ` Paul Eggert
2013-01-23 19:04   ` Eli Zaretskii
2013-01-23 23:38     ` Paul Eggert
2013-01-23 19:42 ` Michael Albinus
2013-01-23 20:05   ` Eli Zaretskii
2013-01-23 20:58     ` Michael Albinus
2013-01-24 16:37       ` Eli Zaretskii
2013-01-23 21:09 ` Stefan Monnier
2013-01-24 17:02   ` Eli Zaretskii
2013-01-24 18:25     ` Stefan Monnier
2013-01-24 18:38       ` Eli Zaretskii
2013-01-25  0:06         ` Stefan Monnier
2013-01-25  7:37           ` Eli Zaretskii
2013-01-25 11:36             ` Stefan Monnier
2013-01-25 20:31               ` Eli Zaretskii
2013-01-25 22:28                 ` Stefan Monnier
2013-01-26 10:54                   ` Eli Zaretskii
2013-01-26 11:34                     ` Stefan Monnier
2013-01-26 13:16                       ` Eli Zaretskii
2013-01-26 22:11                         ` Stefan Monnier
2013-01-27  7:03                           ` Eli Zaretskii [this message]
2013-01-27  8:46                             ` Andreas Schwab
2013-01-27  9:40                               ` Eli Zaretskii
2013-01-28  1:55                             ` Stefan Monnier
2013-01-28 14:44                               ` Eli Zaretskii
2013-01-28 15:21                                 ` Stefan Monnier
2013-02-02 17:19                                   ` Eli Zaretskii
2013-01-26 13:20                       ` Stephen J. Turnbull
2013-01-26  3:04                 ` Stephen J. Turnbull
2013-01-26 11:27                   ` Eli Zaretskii
2013-01-26 13:03                     ` Stephen J. Turnbull
2013-01-26 13:36                       ` Eli Zaretskii
2013-01-26 16:26                         ` Paul Eggert
2013-01-26 18:30                           ` Stephen J. Turnbull
2013-01-26 17:10                         ` Stephen J. Turnbull
2013-01-26 17:33                           ` Eli Zaretskii
2013-01-26 18:06                             ` Paul Eggert
2013-01-26 18:20                               ` Eli Zaretskii
2013-01-26 18:56                             ` Stephen J. Turnbull
2013-01-26 21:40                               ` Stefan Monnier
2013-01-26 21:44                             ` Stefan Monnier
2013-01-27  6:14                               ` Eli Zaretskii
2013-01-26 16:05                   ` Richard Stallman
2013-01-26 17:57                     ` Stephen J. Turnbull
2013-01-26 22:16                     ` Stefan Monnier
2013-01-24 10:00 ` Michael Albinus
2013-01-24 16:40   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=834ni3jefn.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=kzhr@d1.dion.ne.jp \
    --cc=michael.albinus@gmx.de \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.