From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Multibyte and unibyte file names Date: Sun, 27 Jan 2013 09:03:08 +0200 Message-ID: <834ni3jefn.fsf@gnu.org> References: <83ehhbn680.fsf@gnu.org> <83wqv2ldk1.fsf@gnu.org> <83obgel94c.fsf@gnu.org> <83k3r1lnlb.fsf@gnu.org> <83vcalj97s.fsf@gnu.org> <83r4l8jjtv.fsf@gnu.org> <83k3r0jd9r.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1359270200 5160 80.91.229.3 (27 Jan 2013 07:03:20 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 27 Jan 2013 07:03:20 +0000 (UTC) Cc: kzhr@d1.dion.ne.jp, michael.albinus@gmx.de, emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Jan 27 08:03:40 2013 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TzMHA-0001l8-Su for ged-emacs-devel@m.gmane.org; Sun, 27 Jan 2013 08:03:37 +0100 Original-Received: from localhost ([::1]:54386 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TzMGs-0002UN-VD for ged-emacs-devel@m.gmane.org; Sun, 27 Jan 2013 02:03:18 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:32840) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TzMGo-0002TK-7O for emacs-devel@gnu.org; Sun, 27 Jan 2013 02:03:17 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TzMGm-0007R8-Hf for emacs-devel@gnu.org; Sun, 27 Jan 2013 02:03:14 -0500 Original-Received: from mtaout22.012.net.il ([80.179.55.172]:60669) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TzMGm-0007Pb-4j for emacs-devel@gnu.org; Sun, 27 Jan 2013 02:03:12 -0500 Original-Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MH900C00VJIY800@a-mtaout22.012.net.il> for emacs-devel@gnu.org; Sun, 27 Jan 2013 09:02:58 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MH900C2JVKWCXC0@a-mtaout22.012.net.il>; Sun, 27 Jan 2013 09:02:56 +0200 (IST) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.172 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:156681 Archived-At: > From: Stefan Monnier > Cc: emacs-devel@gnu.org, kzhr@d1.dion.ne.jp, michael.albinus@gmx.= de > Date: Sat, 26 Jan 2013 17:11:25 -0500 >=20 > > OK, but as long as file-name primitives are required to support > > unibyte strings, you cannot be sure these situations won't pop up= in > > the future. >=20 > I don't see a need to disallow unibyte strings, but I don't see the= need > to be particularly careful about it either. Basically Elisp code w= hich > provides unibyte file names does it at its own risks. What about C code that calls these primitives? Can we consider every such instance a bug in the caller? If so, we could stop catering to unibyte strings in these primitives, which will make at least some of them a whole lot simpler. > >> I think the right thing to do with unibyte file names is to trea= t them > >> as a sequence of bytes, not a sequence of encoded chars. If the= caller > >> doesn't like it, then she should pass a decoded file name instea= d. > > This effectively means we don't support them _as_file_names_. > > Because, e.g., testing individual bytes for equality to something= like > > '\\' can trip on multibyte (DBCS) encodings if the trailing byte > > happens to be '\\'. In general, it isn't "safe" to iterate over = these > > strings one byte at a time. >=20 > But that's exactly the behavior stipulated by POSIX (tho for '/' ra= ther > than '\\'). I.e. if you use file names on a POSIX host with > a coding-system that occasionally uses '/' within its multibyte > sequences, you'll get those surprises regardless of Emacs. And for= that > reason, Emacs would be right to cut those file names in the middle = of > a multibyte sequence. Then why did you regard this: (let ((file-name-coding-system 'cp932)) (expand-file-name "=E8=A1=A8" "C:/")) =3D> "c:/\225/" as a bug? This is exactly what happens there: the string "=E8=A1= =A8", when encoded with cp932, has '\' as its last byte. > IIUC that's what makes this a "w32-only problem", because the w32 > semantics for file names is based on characters, so a '\\' (or a '/= ') > appearing with a multibyte sequence is not considered by the OS as > a separator. >=20 > And since Emacs is largely based on "POSIX semantics for the generi= c > code, plus an emulation layer in w32.c", we have a problem of subtl= y > incompatible semantics. Maybe so, but it certainly isn't the only place in Emacs with subtly incompatible semantics. And anyway, I don't see how this observation helps to decide what, if anything, to do to fix this. > >> > Are you saying that since this happens > >> > infrequently, we could process such file names in a broken way= , > >> Right. > > He, I don't think this will be well accepted. >=20 > I haven't heard too many screams about this over the years. I heard 2 this week, from 2 different users. Inability to reference file names that are allowed by the underlying filesystem is a bad bug= , IMO. > > And it does that because dostounix_filename needs optionally to > > downcase the name (when w32-downcase-file-names is set). >=20 > Hmm.. but downcasing is an operation on chars, not on bytes, so it > should be applied to decoded names, right? That's not how the code was written. w32.c functions get the strings that are already encoded. > > The way dostounix_filename downcases file names depends on the cu= rrent > > locale, so it must get encoded file names. >=20 > Are you saying that the "downcase" function is not Emacs's own but = is > a function provided by the OS, so we need to encode the name to pas= s it > to that function? That's how the code works, yes. > If so, we need to immediately decode the result. We already do. Example: else if (STRING_MULTIBYTE (filename)) { tem_fn =3D ENCODE_FILE (make_specified_string (beg, -1, p - beg= , 1)); dostounix_filename (SSDATA (tem_fn)); tem_fn =3D DECODE_FILE (tem_fn); } > (and of course this encode+downcase+decode is only done if > w32-downcase-file-names is set). Can't do that, because dostounix_filename also mirrors the backslashe= s and downcases the drive letter -- independently of w32-downcase-file-names. Since dostounix_filename currently operates only on encoded file names, the above is always done for decoded file names. > Alternatively, we could use Emacs's own downcasing function, which = does > not depend on the locale and operates directly on decoded names. That's what I intend to do, indeed, once the dust settles on this discussion, and I understand the requirements. Note that using Emacs's downcase is not a trivial change, because (AFAIK) accessing the downcase_table can trigger GC. Also, downcasin= g might change the byte count of a multibyte string (due to unification), so we cannot pass a 'char *' to dostounix_filename. No= t rocket science, of course, but still... Alternatively, we could downcase inline in the primitives themselves, not inside dostounix_filename. > But indeed for uses of IS_DIRECTORY_SEP in w32.c this is probably m= ore > serious since those functions emulate POSIX calls, so they always r= eceive > encoded file names. I think I already fixed all of them. > > UTF-8 precludes them. Thus my question whether we want to suppor= t > > encoded file names in these primitives as first-class citizens. >=20 > Could you specify a bit more precisely which primitives you have > in mind? Those in fileio.c and in dired.c. I could give an explicit list, if you want.