From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Multibyte and unibyte file names Date: Sun, 27 Jan 2013 02:10:54 +0900 Message-ID: <87mwvv3m5d.fsf@uwakimon.sk.tsukuba.ac.jp> References: <83ehhbn680.fsf@gnu.org> <83wqv2ldk1.fsf@gnu.org> <83obgel94c.fsf@gnu.org> <83k3r1lnlb.fsf@gnu.org> <83vcalj97s.fsf@gnu.org> <87vcak3ar1.fsf@uwakimon.sk.tsukuba.ac.jp> <83mwvwjib3.fsf@gnu.org> <87sj5o2j1b.fsf@uwakimon.sk.tsukuba.ac.jp> <83ham4jcbf.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-Trace: ger.gmane.org 1359220267 28864 80.91.229.3 (26 Jan 2013 17:11:07 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 26 Jan 2013 17:11:07 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Jan 26 18:11:26 2013 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Tz9Ho-0000Nd-TN for ged-emacs-devel@m.gmane.org; Sat, 26 Jan 2013 18:11:25 +0100 Original-Received: from localhost ([::1]:40382 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Tz9HX-0005VW-8C for ged-emacs-devel@m.gmane.org; Sat, 26 Jan 2013 12:11:07 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:47136) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Tz9HT-0005VE-Hi for emacs-devel@gnu.org; Sat, 26 Jan 2013 12:11:05 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Tz9HP-0000ML-8j for emacs-devel@gnu.org; Sat, 26 Jan 2013 12:11:03 -0500 Original-Received: from mgmt2.sk.tsukuba.ac.jp ([130.158.97.224]:50865) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Tz9HN-0000Lk-2j; Sat, 26 Jan 2013 12:10:57 -0500 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mgmt2.sk.tsukuba.ac.jp (Postfix) with ESMTP id 9ADB39708E6; Sun, 27 Jan 2013 02:10:54 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id 647FF1A293C; Sun, 27 Jan 2013 02:10:54 +0900 (JST) In-Reply-To: <83ham4jcbf.fsf@gnu.org> X-Mailer: VM undefined under 21.5 (beta32) "habanero" b0d40183ac79 XEmacs Lucid (x86_64-unknown-linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 130.158.97.224 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:156657 Archived-At: I have to say I'm depressed: it is indeed sounding like a fair amount of work, even without trying to get rid of the root cause. Eli Zaretskii writes: > > My preferred flavor of Emacs never had unibyte. It's got its problems > > in this area, but they're just lazy or over-ambitious programmer bugs, > > not a design flaw. > > I can't reason about something I know nothing about. So this is not a > useful argument. Sure it is. XEmacs is a pretty good facsimile of Emacs-compatibility; the regular howls from people who want to support XEmacs when Emacs does something to break compability are proof of that. Nevertheless, we've never needed unibyte, and our *-as-unibyte functions are no-ops, and nobody has ever complained about that (a fact that remains somewhat surprising to me). > > Of course. In fact, pretty much all interaction with the outside > > world involves byte streams. The problem Emacs is experiencing here > > is that Lisp can see bytes when it is designed only to work with > > characters. > > In GNU Emacs, Lisp can work with bytes as well. Not very well, historically (\207 bug, the expand-file-name bug Stefan mentioned). Nothing to be ashamed of at the counting bugs level: dealing with the bytes/unicode split has cost Python a huge amount of effort, and many bugs. But it was unnecessary in the first place in Emacs. > That's OK. Emacs cannot solve these situations, and I didn't try to > target them. I will be happy enough to correctly support file names > consistently encoded in a single encoding that is the value of > file-name-coding-system. I hope you will agree that having _that_ > broken is not good. It's horrible. I'm just saying that it might very well be worth biting the bullet and eliminating unibyte instead of trying to patch up a fundamentally poor design. Or at least bypass unibyte for these functions. > If you look back at this thread, you will see that this is what I > tried to say, but was consistently told that Posix systems have no > such problems "in practice". Your informants evidently don't live in Japan. In practice it's only a problem if you need to deal with Shift JIS (cp932), such as on a thumb drive or SMB mount (ISTR for CIFS Samba uses Unicode somehow nowadays). Nobody even thinks about using 7-bit JIS etc; POSIX systems use either UTF-8 or EUC-JP (which you may recall is ASCII-compatible, and uses only high-bit-set bytes for Japanese). I imagine there are similar issues for some subset of Chinese due to Big5. It *is* true that such issues are becoming rarer (but Shift JIS incompatibility is a monthly annoyance for me because of a broken FTP server I have to deal with). > Decoding is not a problem, but it hampers efficiency. I'm sorry, but that's, uh, "premature optimization". If Emacs were a p-language, you'd have a wooden leg to stand on.[1] But it's not. People do not write byte-shoveling applications in Emacs Lisp. They do write text-shoveling applications, but to be correct those require atomic characters, so you need to convert anyway. > There's also an associated problem that decoding a file can GC, > which is not good for functions that get 'char *' pointers as > arguments. So never give them a char* into a Lisp_String, or inhibit GC when you do. But strncpy is plenty fast for this application[2], one hell of a lot faster than the system calls you make to access a filesystem. Even strndup is fast enough in our experience. > > In fact AFAIK the set of programs that use the unibyte feature at > > all is pretty small, and most of those (like Tramp) do so only in > > self-defense. > > You are thinking on the wrong level. The problem rears its ugly head > on the C level, not on the Lisp level. Functions in dired.c and > fileio.c manipulate file names, assuming it is safe to address > individual bytes even if the file name is in some DBCS encoding. And that's not mediated by Lisp? I would be surprised if you find any code paths involving dired that grab a filename from the system, pass it to a manipulation function, and then try to access the file without ever storing it in a Lisp object.[3][4] Footnotes: [1] There's plenty of evidence that converting unibyte strings to Unicode (widechar) in Python 3 doesn't hurt anything but the feelings of people who assume it's costly but don't benchmark. [2] You know that the buffersize is at most PATHMAX + 1. [3] Except for very early in initialization of the interpreter, when Emacs is still finding pieces of itself. [4] Indeed those were among the earliest files to be fully Mule-ized in XEmacs, which in XEmacs means that textual data received from outside of XEmacs is immediately converted to internal representation, and only converted back to external representation immediately before the system library call or kernel call that consumes it.