Why does dired go through extra efforts to avoid unibyte names

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Why does dired go through extra efforts to avoid unibyte names
@ 2017-12-29 14:34 Stefan Monnier
  2017-12-29 19:17 ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Monnier @ 2017-12-29 14:34 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

I bumped into the following code in dired-get-filename:

	  ;; The above `read' will return a unibyte string if FILE
	  ;; contains eight-bit-control/graphic characters.
	  (if (and enable-multibyte-characters
		   (not (multibyte-string-p file)))
	      (setq file (string-to-multibyte file)))

and I'm wondering why we don't want a unibyte string here.
`vc-region-history` told me this comes from the commit appended below,
which seems to indicate that we're worried about a subsequent encoding,
but AFAIK unibyte file names are not (re)encoded, and passing them
through string-to-multibyte would actually make things worse in this
respect (since it might cause the kind of (re)encoding this is
supposedly trying to avoid).

What am I missing?


        Stefan


commit 038b550196d92b9844a4efecf1c2ded0f920e957
Author: Kenichi Handa <handa@m17n.org>
Date:   Wed Mar 19 11:58:25 2003 +0000

    * dired.el (dired-get-filename): Pay attention to the case that
    `read' returns a unibyte string.  Don't encode the file name by
    buffer-file-coding-system.

diff --git a/lisp/dired.el b/lisp/dired.el
--- a/lisp/dired.el
+++ b/lisp/dired.el
@@ -1455,11 +1455,16 @@
 	  ;; Using read to unquote is much faster than substituting
 	  ;; \007 (4 chars) -> ^G  (1 char) etc. in a lisp loop.
 	  (setq file
 		(read
 		 (concat "\""
 			 ;; Some ls -b don't escape quotes, argh!
 			 ;; This is not needed for GNU ls, though.
 			 (or (dired-string-replace-match
 			      "\\([^\\]\\|\\`\\)\"" file "\\1\\\\\"" nil t)
 			     file)
-			 "\"")))))
+			 "\"")))
+	  ;; The above `read' will return a unibyte string if FILE
+	  ;; contains eight-bit-control/graphic characters.
+	  (if (and enable-multibyte-characters
+		   (not (multibyte-string-p file)))
+	      (setq file (string-to-multibyte file)))))




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why does dired go through extra efforts to avoid unibyte names
  2017-12-29 14:34 Why does dired go through extra efforts to avoid unibyte names Stefan Monnier
@ 2017-12-29 19:17 ` Eli Zaretskii
  2018-01-03  4:14   ` Stefan Monnier
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2017-12-29 19:17 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Kenichi Handa, emacs-devel, handa

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Fri, 29 Dec 2017 09:34:53 -0500
> Cc: emacs-devel@gnu.org
> 
> I bumped into the following code in dired-get-filename:
> 
> 	  ;; The above `read' will return a unibyte string if FILE
> 	  ;; contains eight-bit-control/graphic characters.
> 	  (if (and enable-multibyte-characters
> 		   (not (multibyte-string-p file)))
> 	      (setq file (string-to-multibyte file)))
> 
> and I'm wondering why we don't want a unibyte string here.
> `vc-region-history` told me this comes from the commit appended below,
> which seems to indicate that we're worried about a subsequent encoding,
> but AFAIK unibyte file names are not (re)encoded, and passing them
> through string-to-multibyte would actually make things worse in this
> respect (since it might cause the kind of (re)encoding this is
> supposedly trying to avoid).
> 
> What am I missing?

Why does it matter whether eight-bit-* characters are encoded one more
or one less time?

As for the reason for using string-to-multibyte: maybe it's because we
use concat further down in the function, which will determine whether
the result will be unibyte or multibyte according to its own ideas of
what's TRT?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why does dired go through extra efforts to avoid unibyte names
  2017-12-29 19:17 ` Eli Zaretskii
@ 2018-01-03  4:14   ` Stefan Monnier
  2018-01-03 15:10     ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Monnier @ 2018-01-03  4:14 UTC (permalink / raw)
  To: emacs-devel

>> I bumped into the following code in dired-get-filename:
>> 
>> 	  ;; The above `read' will return a unibyte string if FILE
>> 	  ;; contains eight-bit-control/graphic characters.
>> 	  (if (and enable-multibyte-characters
>> 		   (not (multibyte-string-p file)))
>> 	      (setq file (string-to-multibyte file)))
>> 
>> and I'm wondering why we don't want a unibyte string here.
>> `vc-region-history` told me this comes from the commit appended below,
>> which seems to indicate that we're worried about a subsequent encoding,
>> but AFAIK unibyte file names are not (re)encoded, and passing them
>> through string-to-multibyte would actually make things worse in this
>> respect (since it might cause the kind of (re)encoding this is
>> supposedly trying to avoid).
>> 
>> What am I missing?
>
> Why does it matter whether eight-bit-* characters are encoded one more
> or one less time?

That's part of the question, indeed.

> As for the reason for using string-to-multibyte: maybe it's because we
> use concat further down in the function, which will determine whether
> the result will be unibyte or multibyte according to its own ideas of
> what's TRT?

But `concat` will do a string-to-multibyte for us, if needed, so
that doesn't seem like a good reason.

This said, when that code was written, maybe `concat` used
string-make-multibyte internally instead, so this call to
string-to-multibyte might have been added to avoid using
string-make-multibyte inside `concat`?

It would be good to have a concrete case that needed the above code, to
see if the problem still exists.


        Stefan




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why does dired go through extra efforts to avoid unibyte names
  2018-01-03  4:14   ` Stefan Monnier
@ 2018-01-03 15:10     ` Eli Zaretskii
  2018-01-03 20:09       ` Stefan Monnier
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2018-01-03 15:10 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Tue, 02 Jan 2018 23:14:20 -0500
> 
> >> I bumped into the following code in dired-get-filename:
> >> 
> >> 	  ;; The above `read' will return a unibyte string if FILE
> >> 	  ;; contains eight-bit-control/graphic characters.
> >> 	  (if (and enable-multibyte-characters
> >> 		   (not (multibyte-string-p file)))
> >> 	      (setq file (string-to-multibyte file)))
> >> 
> >> and I'm wondering why we don't want a unibyte string here.
> >> `vc-region-history` told me this comes from the commit appended below,
> >> which seems to indicate that we're worried about a subsequent encoding,
> >> but AFAIK unibyte file names are not (re)encoded, and passing them
> >> through string-to-multibyte would actually make things worse in this
> >> respect (since it might cause the kind of (re)encoding this is
> >> supposedly trying to avoid).
> >> 
> >> What am I missing?
> >
> > Why does it matter whether eight-bit-* characters are encoded one more
> > or one less time?
> 
> That's part of the question, indeed.

The question was meant to be rhetorical ;-)  Eight-bit-* characters
are not in general modified by encoding them, so you could encode them
any number of times and still get the same bytes as result.

> > As for the reason for using string-to-multibyte: maybe it's because we
> > use concat further down in the function, which will determine whether
> > the result will be unibyte or multibyte according to its own ideas of
> > what's TRT?
> 
> But `concat` will do a string-to-multibyte for us, if needed

Not if the other concatenated parts are ASCII (which tend to be
unibyte strings).

> This said, when that code was written, maybe `concat` used
> string-make-multibyte internally instead, so this call to
> string-to-multibyte might have been added to avoid using
> string-make-multibyte inside `concat`?

Could be.  I tried to look for relevant discussions around the time of
the commit, but couldn't find anything that would explain the reason.

> It would be good to have a concrete case that needed the above code, to
> see if the problem still exists.

Yep.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why does dired go through extra efforts to avoid unibyte names
  2018-01-03 15:10     ` Eli Zaretskii
@ 2018-01-03 20:09       ` Stefan Monnier
  2018-01-05  9:10         ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Monnier @ 2018-01-03 20:09 UTC (permalink / raw)
  To: emacs-devel

> Eight-bit-* characters are not in general modified by encoding them,
> so you could encode them any number of times and still get the same
> bytes as result.

Agreed.  But even if it were not the case, I don't see why that would
explain the presence of this code.

>> > As for the reason for using string-to-multibyte: maybe it's because we
>> > use concat further down in the function, which will determine whether
>> > the result will be unibyte or multibyte according to its own ideas of
>> > what's TRT?
>> But `concat` will do a string-to-multibyte for us, if needed
> Not if the other concatenated parts are ASCII (which tend to be
> unibyte strings).

But that's still perfectly fine as well since it will then result in
a unibyte string which will get "encoded" correctly.


        Stefan




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why does dired go through extra efforts to avoid unibyte names
  2018-01-03 20:09       ` Stefan Monnier
@ 2018-01-05  9:10         ` Eli Zaretskii
  2018-01-05 16:12           ` Stefan Monnier
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2018-01-05  9:10 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Wed, 03 Jan 2018 15:09:06 -0500
> 
> > Eight-bit-* characters are not in general modified by encoding them,
> > so you could encode them any number of times and still get the same
> > bytes as result.
> 
> Agreed.  But even if it were not the case, I don't see why that would
> explain the presence of this code.

I meant to ask why do _you_ worry about eight-bit-* characters being
encoded more than once?

> >> > As for the reason for using string-to-multibyte: maybe it's because we
> >> > use concat further down in the function, which will determine whether
> >> > the result will be unibyte or multibyte according to its own ideas of
> >> > what's TRT?
> >> But `concat` will do a string-to-multibyte for us, if needed
> > Not if the other concatenated parts are ASCII (which tend to be
> > unibyte strings).
> 
> But that's still perfectly fine as well since it will then result in
> a unibyte string which will get "encoded" correctly.

Where do you see encoding in this picture?

I think the issue is that we want dired-get-filename to always return
a multibyte string, so that its callers don't need to deal with the
complications, like inserting unibyte strings into multibyte buffers,
concatenating them with leading directories to form other file names,
etc.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why does dired go through extra efforts to avoid unibyte names
  2018-01-05  9:10         ` Eli Zaretskii
@ 2018-01-05 16:12           ` Stefan Monnier
  2018-01-05 18:14             ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Monnier @ 2018-01-05 16:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> I meant to ask why do _you_ worry about eight-bit-* characters being
> encoded more than once?

I don't really worry about it (other than as part of understanding why
the only explanation accompanying this code mentions it).

> I think the issue is that we want dired-get-filename to always return
> a multibyte string, so that its callers don't need to deal with the
> complications, like inserting unibyte strings into multibyte buffers,
> concatenating them with leading directories to form other file names,
> etc.

AFAICT a multibyte string which only consists of ascii and eight-bit
bytes will "suffer" from the exact same problems as the corresponding
unibyte string (two such strings can be called "equal modulo
multibyteness").

Actually, most primitives will handle those two strings in the same way
E.g. inserting either string into a buffer gives the same result (both
for unibyte and multibyte buffers), concatenating either of those
strings to a multibyte string gives the same result.
Concatenating either of those strings to a unibyte string does not give
the same result, but the two results are again "equal modulo
multibyteness".

So I can't imagine a scenario where calling string-to-multibyte here
will help subsequent code.

        Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why does dired go through extra efforts to avoid unibyte names
  2018-01-05 16:12           ` Stefan Monnier
@ 2018-01-05 18:14             ` Eli Zaretskii
  0 siblings, 0 replies; 8+ messages in thread
From: Eli Zaretskii @ 2018-01-05 18:14 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Fri, 05 Jan 2018 11:12:38 -0500
> 
> E.g. inserting either string into a buffer gives the same result (both
> for unibyte and multibyte buffers), concatenating either of those
> strings to a multibyte string gives the same result.

Depends on which of the 2 representations of raw bytes is used, I
think.  See bug#29189.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-01-05 18:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-29 14:34 Why does dired go through extra efforts to avoid unibyte names Stefan Monnier
2017-12-29 19:17 ` Eli Zaretskii
2018-01-03  4:14   ` Stefan Monnier
2018-01-03 15:10     ` Eli Zaretskii
2018-01-03 20:09       ` Stefan Monnier
2018-01-05  9:10         ` Eli Zaretskii
2018-01-05 16:12           ` Stefan Monnier
2018-01-05 18:14             ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).