From: Eli Zaretskii <eliz@gnu.org>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: kzhr@d1.dion.ne.jp, michael.albinus@gmx.de, emacs-devel@gnu.org
Subject: Re: Multibyte and unibyte file names
Date: Sun, 27 Jan 2013 09:03:08 +0200 [thread overview]
Message-ID: <834ni3jefn.fsf@gnu.org> (raw)
In-Reply-To: <jwva9rvmxdn.fsf-monnier+emacs@gnu.org>
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org, kzhr@d1.dion.ne.jp, michael.albinus@gmx.de
> Date: Sat, 26 Jan 2013 17:11:25 -0500
>
> > OK, but as long as file-name primitives are required to support
> > unibyte strings, you cannot be sure these situations won't pop up in
> > the future.
>
> I don't see a need to disallow unibyte strings, but I don't see the need
> to be particularly careful about it either. Basically Elisp code which
> provides unibyte file names does it at its own risks.
What about C code that calls these primitives? Can we consider every
such instance a bug in the caller? If so, we could stop catering to
unibyte strings in these primitives, which will make at least some of
them a whole lot simpler.
> >> I think the right thing to do with unibyte file names is to treat them
> >> as a sequence of bytes, not a sequence of encoded chars. If the caller
> >> doesn't like it, then she should pass a decoded file name instead.
> > This effectively means we don't support them _as_file_names_.
> > Because, e.g., testing individual bytes for equality to something like
> > '\\' can trip on multibyte (DBCS) encodings if the trailing byte
> > happens to be '\\'. In general, it isn't "safe" to iterate over these
> > strings one byte at a time.
>
> But that's exactly the behavior stipulated by POSIX (tho for '/' rather
> than '\\'). I.e. if you use file names on a POSIX host with
> a coding-system that occasionally uses '/' within its multibyte
> sequences, you'll get those surprises regardless of Emacs. And for that
> reason, Emacs would be right to cut those file names in the middle of
> a multibyte sequence.
Then why did you regard this:
(let ((file-name-coding-system 'cp932))
(expand-file-name "表" "C:/"))
=> "c:/\225/"
as a bug? This is exactly what happens there: the string "表", when
encoded with cp932, has '\' as its last byte.
> IIUC that's what makes this a "w32-only problem", because the w32
> semantics for file names is based on characters, so a '\\' (or a '/')
> appearing with a multibyte sequence is not considered by the OS as
> a separator.
>
> And since Emacs is largely based on "POSIX semantics for the generic
> code, plus an emulation layer in w32.c", we have a problem of subtly
> incompatible semantics.
Maybe so, but it certainly isn't the only place in Emacs with subtly
incompatible semantics. And anyway, I don't see how this observation
helps to decide what, if anything, to do to fix this.
> >> > Are you saying that since this happens
> >> > infrequently, we could process such file names in a broken way,
> >> Right.
> > He, I don't think this will be well accepted.
>
> I haven't heard too many screams about this over the years.
I heard 2 this week, from 2 different users. Inability to reference
file names that are allowed by the underlying filesystem is a bad bug,
IMO.
> > And it does that because dostounix_filename needs optionally to
> > downcase the name (when w32-downcase-file-names is set).
>
> Hmm.. but downcasing is an operation on chars, not on bytes, so it
> should be applied to decoded names, right?
That's not how the code was written. w32.c functions get the strings
that are already encoded.
> > The way dostounix_filename downcases file names depends on the current
> > locale, so it must get encoded file names.
>
> Are you saying that the "downcase" function is not Emacs's own but is
> a function provided by the OS, so we need to encode the name to pass it
> to that function?
That's how the code works, yes.
> If so, we need to immediately decode the result.
We already do. Example:
else if (STRING_MULTIBYTE (filename))
{
tem_fn = ENCODE_FILE (make_specified_string (beg, -1, p - beg, 1));
dostounix_filename (SSDATA (tem_fn));
tem_fn = DECODE_FILE (tem_fn);
}
> (and of course this encode+downcase+decode is only done if
> w32-downcase-file-names is set).
Can't do that, because dostounix_filename also mirrors the backslashes
and downcases the drive letter -- independently of
w32-downcase-file-names. Since dostounix_filename currently operates
only on encoded file names, the above is always done for decoded file
names.
> Alternatively, we could use Emacs's own downcasing function, which does
> not depend on the locale and operates directly on decoded names.
That's what I intend to do, indeed, once the dust settles on this
discussion, and I understand the requirements.
Note that using Emacs's downcase is not a trivial change, because
(AFAIK) accessing the downcase_table can trigger GC. Also, downcasing
might change the byte count of a multibyte string (due to
unification), so we cannot pass a 'char *' to dostounix_filename. Not
rocket science, of course, but still...
Alternatively, we could downcase inline in the primitives themselves,
not inside dostounix_filename.
> But indeed for uses of IS_DIRECTORY_SEP in w32.c this is probably more
> serious since those functions emulate POSIX calls, so they always receive
> encoded file names.
I think I already fixed all of them.
> > UTF-8 precludes them. Thus my question whether we want to support
> > encoded file names in these primitives as first-class citizens.
>
> Could you specify a bit more precisely which primitives you have
> in mind?
Those in fileio.c and in dired.c. I could give an explicit list, if
you want.
next prev parent reply other threads:[~2013-01-27 7:03 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-23 17:45 Multibyte and unibyte file names Eli Zaretskii
2013-01-23 18:08 ` Paul Eggert
2013-01-23 19:04 ` Eli Zaretskii
2013-01-23 23:38 ` Paul Eggert
2013-01-23 19:42 ` Michael Albinus
2013-01-23 20:05 ` Eli Zaretskii
2013-01-23 20:58 ` Michael Albinus
2013-01-24 16:37 ` Eli Zaretskii
2013-01-23 21:09 ` Stefan Monnier
2013-01-24 17:02 ` Eli Zaretskii
2013-01-24 18:25 ` Stefan Monnier
2013-01-24 18:38 ` Eli Zaretskii
2013-01-25 0:06 ` Stefan Monnier
2013-01-25 7:37 ` Eli Zaretskii
2013-01-25 11:36 ` Stefan Monnier
2013-01-25 20:31 ` Eli Zaretskii
2013-01-25 22:28 ` Stefan Monnier
2013-01-26 10:54 ` Eli Zaretskii
2013-01-26 11:34 ` Stefan Monnier
2013-01-26 13:16 ` Eli Zaretskii
2013-01-26 22:11 ` Stefan Monnier
2013-01-27 7:03 ` Eli Zaretskii [this message]
2013-01-27 8:46 ` Andreas Schwab
2013-01-27 9:40 ` Eli Zaretskii
2013-01-28 1:55 ` Stefan Monnier
2013-01-28 14:44 ` Eli Zaretskii
2013-01-28 15:21 ` Stefan Monnier
2013-02-02 17:19 ` Eli Zaretskii
2013-01-26 13:20 ` Stephen J. Turnbull
2013-01-26 3:04 ` Stephen J. Turnbull
2013-01-26 11:27 ` Eli Zaretskii
2013-01-26 13:03 ` Stephen J. Turnbull
2013-01-26 13:36 ` Eli Zaretskii
2013-01-26 16:26 ` Paul Eggert
2013-01-26 18:30 ` Stephen J. Turnbull
2013-01-26 17:10 ` Stephen J. Turnbull
2013-01-26 17:33 ` Eli Zaretskii
2013-01-26 18:06 ` Paul Eggert
2013-01-26 18:20 ` Eli Zaretskii
2013-01-26 18:56 ` Stephen J. Turnbull
2013-01-26 21:40 ` Stefan Monnier
2013-01-26 21:44 ` Stefan Monnier
2013-01-27 6:14 ` Eli Zaretskii
2013-01-26 16:05 ` Richard Stallman
2013-01-26 17:57 ` Stephen J. Turnbull
2013-01-26 22:16 ` Stefan Monnier
2013-01-24 10:00 ` Michael Albinus
2013-01-24 16:40 ` Eli Zaretskii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=834ni3jefn.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=emacs-devel@gnu.org \
--cc=kzhr@d1.dion.ne.jp \
--cc=michael.albinus@gmx.de \
--cc=monnier@iro.umontreal.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.