From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Multibyte and unibyte file names Date: Wed, 23 Jan 2013 21:04:46 +0200 Message-ID: <83bocfn2k1.fsf@gnu.org> References: <83ehhbn680.fsf@gnu.org> <51002719.3080805@cs.ucla.edu> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1358968502 25200 80.91.229.3 (23 Jan 2013 19:15:02 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 23 Jan 2013 19:15:02 +0000 (UTC) Cc: kzhr@d1.dion.ne.jp, michael.albinus@gmx.de, emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 23 20:15:16 2013 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Ty5dI-0006ZC-1W for ged-emacs-devel@m.gmane.org; Wed, 23 Jan 2013 20:05:12 +0100 Original-Received: from localhost ([::1]:35464 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ty5d0-0008S4-Dz for ged-emacs-devel@m.gmane.org; Wed, 23 Jan 2013 14:04:54 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:54455) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ty5cx-0008Rk-7H for emacs-devel@gnu.org; Wed, 23 Jan 2013 14:04:52 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ty5ct-0000gf-Ca for emacs-devel@gnu.org; Wed, 23 Jan 2013 14:04:51 -0500 Original-Received: from mtaout22.012.net.il ([80.179.55.172]:59392) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ty5ct-0000fq-4M for emacs-devel@gnu.org; Wed, 23 Jan 2013 14:04:47 -0500 Original-Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MH300B00EAV5M00@a-mtaout22.012.net.il> for emacs-devel@gnu.org; Wed, 23 Jan 2013 21:04:44 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MH300AGNEBWEZE0@a-mtaout22.012.net.il>; Wed, 23 Jan 2013 21:04:44 +0200 (IST) In-reply-to: <51002719.3080805@cs.ucla.edu> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.172 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:156603 Archived-At: > Date: Wed, 23 Jan 2013 10:08:25 -0800 > From: Paul Eggert > CC: emacs-devel@gnu.org, Kazuhiro Ito , > Michael Albinus > > On 01/23/13 09:45, Eli Zaretskii wrote: > > > if (srclen > 1 > > && IS_DIRECTORY_SEP (dst[srclen - 1])) > > { > > dst[srclen - 1] = 0; > > srclen--; > > } > > > > If dst[] is an encoded string that uses a multibyte encoding, it is > > wrong to look at just the last byte of the string, because it could be > > a trailing byte of some multibyte sequence, right? > > If memory serves, the answer to that question is different for > GNU / POSIX / etc (GNUish) systems than for MS-Windows systems. > On GNUish systems, the kernel doesn't know about encodings, > so the above code is correct for the file system even if > it produces a byte string that is not properly encoded for > the file name coding system. I understand that, but what it means is that encoding a file name, then removing its last "slash" as above, then decoding it again will yield a wrong or even an invalid string, right? IOW, Emacs will still have a bug, even though from the OS point of view that slash would have been regarded as a directory separator. > On MS-Windows systems, as I understand it, the operating system is > cognizant of which file name encoding you're using, so the above is > indeed an error. The OS uses UTF-16 for file names, but APIs Emacs uses accept single-byte or DBCS encoded file names, which are converted to UTF-16 internally, before handing them to the filesystem layer. It is this conversion that must support the original encoding, or else the UTF-16 result will be incorrect, or in extreme cases the API itself will fail and reject the file name. > In practice nobody in the GNUish world uses encodings that > are unsafe for '/', so to some extent this is just a theoretical > issue in the GNUish world -- it just doesn't come up. Yes, that part is quite clear. Likewise, since UTF-8 is almost always the file-name encoding, bugs whereby un-encoded file names are passed to system APIs can easily go unnoticed.