Re: Multibyte and unibyte file names

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: "Stephen J. Turnbull" <stephen@xemacs.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org
Subject: Re: Multibyte and unibyte file names
Date: Sun, 27 Jan 2013 02:10:54 +0900	[thread overview]
Message-ID: <87mwvv3m5d.fsf@uwakimon.sk.tsukuba.ac.jp> (raw)
In-Reply-To: <83ham4jcbf.fsf@gnu.org>

I have to say I'm depressed: it is indeed sounding like a fair amount
of work, even without trying to get rid of the root cause.

Eli Zaretskii writes:

 > > My preferred flavor of Emacs never had unibyte.  It's got its problems
 > > in this area, but they're just lazy or over-ambitious programmer bugs,
 > > not a design flaw.
 > 
 > I can't reason about something I know nothing about.  So this is not a
 > useful argument.

Sure it is.  XEmacs is a pretty good facsimile of Emacs-compatibility;
the regular howls from people who want to support XEmacs when Emacs
does something to break compability are proof of that.  Nevertheless,
we've never needed unibyte, and our *-as-unibyte functions are no-ops,
and nobody has ever complained about that (a fact that remains
somewhat surprising to me).

 > > Of course.  In fact, pretty much all interaction with the outside
 > > world involves byte streams.  The problem Emacs is experiencing here
 > > is that Lisp can see bytes when it is designed only to work with
 > > characters.
 > 
 > In GNU Emacs, Lisp can work with bytes as well.

Not very well, historically (\207 bug, the expand-file-name bug Stefan
mentioned).  Nothing to be ashamed of at the counting bugs level:
dealing with the bytes/unicode split has cost Python a huge amount of
effort, and many bugs.  But it was unnecessary in the first place in
Emacs.

 > That's OK.  Emacs cannot solve these situations, and I didn't try to
 > target them.  I will be happy enough to correctly support file names
 > consistently encoded in a single encoding that is the value of
 > file-name-coding-system.  I hope you will agree that having _that_
 > broken is not good.

It's horrible.  I'm just saying that it might very well be worth
biting the bullet and eliminating unibyte instead of trying to patch
up a fundamentally poor design.  Or at least bypass unibyte for these
functions.

 > If you look back at this thread, you will see that this is what I
 > tried to say, but was consistently told that Posix systems have no
 > such problems "in practice".

Your informants evidently don't live in Japan.  In practice it's only
a problem if you need to deal with Shift JIS (cp932), such as on a
thumb drive or SMB mount (ISTR for CIFS Samba uses Unicode somehow
nowadays).  Nobody even thinks about using 7-bit JIS etc; POSIX
systems use either UTF-8 or EUC-JP (which you may recall is
ASCII-compatible, and uses only high-bit-set bytes for Japanese).  I
imagine there are similar issues for some subset of Chinese due to
Big5.

It *is* true that such issues are becoming rarer (but Shift JIS
incompatibility is a monthly annoyance for me because of a broken FTP
server I have to deal with).

 > Decoding is not a problem, but it hampers efficiency.

I'm sorry, but that's, uh, "premature optimization".  If Emacs were a
p-language, you'd have a wooden leg to stand on.[1]  But it's not.
People do not write byte-shoveling applications in Emacs Lisp.  They
do write text-shoveling applications, but to be correct those require
atomic characters, so you need to convert anyway.

 > There's also an associated problem that decoding a file can GC,
 > which is not good for functions that get 'char *' pointers as
 > arguments.

So never give them a char* into a Lisp_String, or inhibit GC when you
do.  But strncpy is plenty fast for this application[2], one hell of a
lot faster than the system calls you make to access a filesystem.
Even strndup is fast enough in our experience.

 > > In fact AFAIK the set of programs that use the unibyte feature at
 > > all is pretty small, and most of those (like Tramp) do so only in
 > > self-defense.
 > 
 > You are thinking on the wrong level.  The problem rears its ugly head
 > on the C level, not on the Lisp level.  Functions in dired.c and
 > fileio.c manipulate file names, assuming it is safe to address
 > individual bytes even if the file name is in some DBCS encoding.

And that's not mediated by Lisp?  I would be surprised if you find any
code paths involving dired that grab a filename from the system, pass
it to a manipulation function, and then try to access the file without
ever storing it in a Lisp object.[3][4]

Footnotes:
[1]  There's plenty of evidence that converting unibyte strings to
Unicode (widechar) in Python 3 doesn't hurt anything but the feelings
of people who assume it's costly but don't benchmark.

[2]  You know that the buffersize is at most PATHMAX + 1.

[3]  Except for very early in initialization of the interpreter, when
Emacs is still finding pieces of itself.

[4]  Indeed those were among the earliest files to be fully Mule-ized
in XEmacs, which in XEmacs means that textual data received from
outside of XEmacs is immediately converted to internal representation,
and only converted back to external representation immediately before
the system library call or kernel call that consumes it.

next prev parent reply	other threads:[~2013-01-26 17:10 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-23 17:45 Multibyte and unibyte file names Eli Zaretskii
2013-01-23 18:08 ` Paul Eggert
2013-01-23 19:04   ` Eli Zaretskii
2013-01-23 23:38     ` Paul Eggert
2013-01-23 19:42 ` Michael Albinus
2013-01-23 20:05   ` Eli Zaretskii
2013-01-23 20:58     ` Michael Albinus
2013-01-24 16:37       ` Eli Zaretskii
2013-01-23 21:09 ` Stefan Monnier
2013-01-24 17:02   ` Eli Zaretskii
2013-01-24 18:25     ` Stefan Monnier
2013-01-24 18:38       ` Eli Zaretskii
2013-01-25  0:06         ` Stefan Monnier
2013-01-25  7:37           ` Eli Zaretskii
2013-01-25 11:36             ` Stefan Monnier
2013-01-25 20:31               ` Eli Zaretskii
2013-01-25 22:28                 ` Stefan Monnier
2013-01-26 10:54                   ` Eli Zaretskii
2013-01-26 11:34                     ` Stefan Monnier
2013-01-26 13:16                       ` Eli Zaretskii
2013-01-26 22:11                         ` Stefan Monnier
2013-01-27  7:03                           ` Eli Zaretskii
2013-01-27  8:46                             ` Andreas Schwab
2013-01-27  9:40                               ` Eli Zaretskii
2013-01-28  1:55                             ` Stefan Monnier
2013-01-28 14:44                               ` Eli Zaretskii
2013-01-28 15:21                                 ` Stefan Monnier
2013-02-02 17:19                                   ` Eli Zaretskii
2013-01-26 13:20                       ` Stephen J. Turnbull
2013-01-26  3:04                 ` Stephen J. Turnbull
2013-01-26 11:27                   ` Eli Zaretskii
2013-01-26 13:03                     ` Stephen J. Turnbull
2013-01-26 13:36                       ` Eli Zaretskii
2013-01-26 16:26                         ` Paul Eggert
2013-01-26 18:30                           ` Stephen J. Turnbull
2013-01-26 17:10                         ` Stephen J. Turnbull [this message]
2013-01-26 17:33                           ` Eli Zaretskii
2013-01-26 18:06                             ` Paul Eggert
2013-01-26 18:20                               ` Eli Zaretskii
2013-01-26 18:56                             ` Stephen J. Turnbull
2013-01-26 21:40                               ` Stefan Monnier
2013-01-26 21:44                             ` Stefan Monnier
2013-01-27  6:14                               ` Eli Zaretskii
2013-01-26 16:05                   ` Richard Stallman
2013-01-26 17:57                     ` Stephen J. Turnbull
2013-01-26 22:16                     ` Stefan Monnier
2013-01-24 10:00 ` Michael Albinus
2013-01-24 16:40   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mwvv3m5d.fsf@uwakimon.sk.tsukuba.ac.jp \
    --to=stephen@xemacs.org \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.