From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Multibyte and unibyte file names Date: Wed, 23 Jan 2013 19:45:35 +0200 Message-ID: <83ehhbn680.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=shift_jis Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1358963166 3189 80.91.229.3 (23 Jan 2013 17:46:06 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 23 Jan 2013 17:46:06 +0000 (UTC) Cc: Kazuhiro Ito , Michael Albinus To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 23 18:46:24 2013 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Ty4P1-0001NW-Ld for ged-emacs-devel@m.gmane.org; Wed, 23 Jan 2013 18:46:23 +0100 Original-Received: from localhost ([::1]:59533 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ty4Ok-0002M1-CY for ged-emacs-devel@m.gmane.org; Wed, 23 Jan 2013 12:46:06 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:32852) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ty4Of-0002EA-Oh for emacs-devel@gnu.org; Wed, 23 Jan 2013 12:46:03 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ty4Ob-0001ME-QK for emacs-devel@gnu.org; Wed, 23 Jan 2013 12:46:01 -0500 Original-Received: from mtaout22.012.net.il ([80.179.55.172]:38720) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ty4Ob-0001Ll-EA for emacs-devel@gnu.org; Wed, 23 Jan 2013 12:45:57 -0500 Original-Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MH300A00AH1A800@a-mtaout22.012.net.il> for emacs-devel@gnu.org; Wed, 23 Jan 2013 19:45:34 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MH3009YIANXWOB0@a-mtaout22.012.net.il>; Wed, 23 Jan 2013 19:45:34 +0200 (IST) X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.172 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:156601 Archived-At: For some initial context, see http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D13515#14 and my response there. However, the issue at hand is IMO much more broad. Let me start with a question: do file primitives need to support unibyte file names, as well as multibyte ones? To avoid ambiguity, let me say right away that by "unibyte" I mean here file names encode= d in some file-name-coding-system, possibly with non-ASCII characters. I do NOT mean pure-ASCII file names (which in Emacs are normally represented as unibyte strings). Looking at the code, it sounds like the answer to the above is YES. For example, expand-file-name clearly tries to be careful to support both, as seen, for example, from this snippet: multibyte =3D STRING_MULTIBYTE (name); if (multibyte !=3D STRING_MULTIBYTE (default_directory)) { if (multibyte) =09default_directory =3D string_to_multibyte (default_directory); else =09{ =09 name =3D string_to_multibyte (name); =09 multibyte =3D 1; =09} } Moreover, some other primitives clearly expect other primitives to work on encoded file names. Here's a fragment from file_name_completion: encoded_dir =3D ENCODE_FILE (dirname); block_input (); d =3D opendir (SSDATA (Fdirectory_file_name (encoded_dir))); Assuming that encoded file names _should_ be supported, I think this snippet, from directory_file_name, is a bug: if (srclen > 1 && IS_DIRECTORY_SEP (dst[srclen - 1])) { dst[srclen - 1] =3D 0; srclen--; } If dst[] is an encoded string that uses a multibyte encoding, it is wrong to look at just the last byte of the string, because it could b= e a trailing byte of some multibyte sequence, right? There are a lot o= f similar fragments in fileio.c, so much so that it seems as if there's a hidden assumption that these strings cannot be encoded. Which seem= s to contradict the two fragments above, from expand-file-name and from file_name_completion. Am I missing something? Why is this important? For 2 main reasons: 1) Many file primitives call dostounix_filename on MS-Windows. That function converts backslashes to forward slashes and optionally down-cases the file name. It is currently written to accept an encoded file name, and as long as file primitives need to support unibyte file names, dostounix_filename must DTRT with them. Encoding file names means in some situations that file names un-encodable in file-name-coding-system come out butchered from dostounix_filename, whereas some primitives are supposed to work on the file names on the syntactic level only, which is independent of whether or not a file can be passed to the underlying filesystem. This also means that only cpNNNN encoding= s are fully supported on MS-Windows, because for other encodings Windows APIs don't have information which allows, e.g., advancing by characters in an encoded file name, looking for slashes and backslashes, and down-casing characters. 2) This gets worse with remote file names. For these, the handlers are always called first, and the result is never run through dostounix_filename. However, Tramp sometimes turns around and calls the "real" handler on parts of the remote file name, evidently expecting that "real" handler not to do any harm. But due to the above, it does do harm. While it might be justified t= o limit native file name support to file names encodable with the current file-name-coding-system, it _cannot_ be justified for remote file names. An example of this is file-name-directory: (defun tramp-handle-file-name-directory (file) "Like `file-name-directory' but aware of Tramp files." ;; Everything except the last filename thing is the directory.= We ;; cannot apply `with-parsed-tramp-file-name', because this ex= pands ;; the remote file name parts. This is a problem when we are = in ;; file name completion. (let ((v (tramp-dissect-file-name file t))) =09 ;; Run the command on the localname portion only. =09 (tramp-make-tramp-file-name =09 (tramp-file-name-method v) =09 (tramp-file-name-user v) =09 (tramp-file-name-host v) =09 (tramp-run-real-handler =09 'file-name-directory (list (or (tramp-file-name-localname v) ""= )))))) which on Windows means that, e.g. (let ((file-name-coding-system 'cp1252)) =09(file-name-directory "/eliz@fencepost.gnu.org:=8A=BF=8E=9A/")) =3D> "/eliz@fencepost.gnu.org: /" And there are other similar handlers in Tramp (e.g., the file-name-nondirectory handler) which do the same. IOW, they seem to _assume_ that the corresponding "real" handler never needs to encode the file name. A false assumption. I don't know what to do with this mess. If file primitives are not supposed to handle encoded file names, dostounix_filename could be rewritten to work on multibyte strings in Emacs's internal representation, and then it wouldn't need to rely on Windows APIs tha= t require the encoding to be known to Windows and the characters in the file name be encodable in that encoding. But that would need non-trivial changes elsewhere, and we need to decide what to do if an encoded string does get passed to these primitives (signal an error?)= . Note that, as long as encoded multibyte strings can get into these primitives, code that advances by bytes and examines individual bytes for equality to certain values like '/' is buggy on Unix as well, unless I'm missing something. Comments are welcome, as well as pointers to what I missed. TIA