unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Re: master 6011d39b6a: Fix drag-and-drop of files with multibyte filenames
@ 2022-06-05  9:21 Eli Zaretskii
  2022-06-05 10:00 ` Po Lu
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2022-06-05  9:21 UTC (permalink / raw)
  To: Po Lu; +Cc: emacs-devel

> --- a/lisp/select.el
> +++ b/lisp/select.el
> @@ -630,20 +630,20 @@ two markers or an overlay.  Otherwise, it is nil."
>          (xselect--encode-string 'TEXT (buffer-file-name (nth 2 value))))
>      (if (and (stringp value)
>               (file-exists-p value))
> -        (xselect--encode-string 'TEXT (expand-file-name value)
> -                                nil t)
> +        ;; Motif expects this to be STRING, but it treats the data as
> +        ;; a sequence of bytes instead of a Latin-1 string.
> +        (cons 'STRING (encode-coding-string (expand-file-name value)
> +                                            'raw-text-unix))

I don't think I understand this change.  raw-text basically doesn't do
any conversion, except if the text includes raw bytes.  Is that the
problem here, and if so, how come a file name can include raw bytes in
its name?  And what does "Motif expects this to be STRING, but it
treats the data as a sequence of bytes instead of a Latin-1 string"
mean in this context?  The difference between raw bytes and Latin-1
strings is only meaningful to Emacs; how does Motif distinguish
between them?



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: master 6011d39b6a: Fix drag-and-drop of files with multibyte filenames
  2022-06-05  9:21 master 6011d39b6a: Fix drag-and-drop of files with multibyte filenames Eli Zaretskii
@ 2022-06-05 10:00 ` Po Lu
  2022-06-05 10:31   ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Po Lu @ 2022-06-05 10:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> I don't think I understand this change.  raw-text basically doesn't do
> any conversion, except if the text includes raw bytes.  Is that the
> problem here, and if so, how come a file name can include raw bytes in
> its name?

Encoding it as `raw-text-unix' is to satisfy the requirement in
xselect.c that strings returned by selection converters must be
unibyte.  IOW, it's the same as

  (string-as-unibyte (expand-file-name value))

except that we can't use `string-as-unibyte', because it's obsolete.

> And what does "Motif expects this to be STRING, but it treats the data
> as a sequence of bytes instead of a Latin-1 string" mean in this
> context?  The difference between raw bytes and Latin-1 strings is only
> meaningful to Emacs; how does Motif distinguish between them?

The selection property type STRING means a Latin-1 string, with some
minor extensions.  See this paragraph under "TEXT Properties" in the
ICCCM:

   STRING as a type or a target specifies the ISO Latin-1 character set
   plus the control characters TAB (octal 11) and NEWLINE (octal
   12). The spacing interpretation of TAB is context dependent. Other
   ASCII control characters are explicitly not included in STRING at the
   present time.

But Motif doesn't comply with the ICCCM meaning of STRING or use the
generic TEXT type when converting a drag-and-drop selection to
FILE_NAME.  It instead expects the type of the selection property to be
STRING, but the data is treated as raw bytes.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: master 6011d39b6a: Fix drag-and-drop of files with multibyte filenames
  2022-06-05 10:00 ` Po Lu
@ 2022-06-05 10:31   ` Eli Zaretskii
  2022-06-05 11:42     ` Po Lu
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2022-06-05 10:31 UTC (permalink / raw)
  To: Po Lu; +Cc: emacs-devel

> From: Po Lu <luangruo@yahoo.com>
> Cc: emacs-devel@gnu.org
> Date: Sun, 05 Jun 2022 18:00:10 +0800
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > I don't think I understand this change.  raw-text basically doesn't do
> > any conversion, except if the text includes raw bytes.  Is that the
> > problem here, and if so, how come a file name can include raw bytes in
> > its name?
> 
> Encoding it as `raw-text-unix' is to satisfy the requirement in
> xselect.c that strings returned by selection converters must be
> unibyte.  IOW, it's the same as
> 
>   (string-as-unibyte (expand-file-name value))
> 
> except that we can't use `string-as-unibyte', because it's obsolete.

Then why not encode in UTF-8, for example?

> > And what does "Motif expects this to be STRING, but it treats the data
> > as a sequence of bytes instead of a Latin-1 string" mean in this
> > context?  The difference between raw bytes and Latin-1 strings is only
> > meaningful to Emacs; how does Motif distinguish between them?
> 
> The selection property type STRING means a Latin-1 string, with some
> minor extensions.  See this paragraph under "TEXT Properties" in the
> ICCCM:
> 
>    STRING as a type or a target specifies the ISO Latin-1 character set
>    plus the control characters TAB (octal 11) and NEWLINE (octal
>    12). The spacing interpretation of TAB is context dependent. Other
>    ASCII control characters are explicitly not included in STRING at the
>    present time.
> 
> But Motif doesn't comply with the ICCCM meaning of STRING or use the
> generic TEXT type when converting a drag-and-drop selection to
> FILE_NAME.  It instead expects the type of the selection property to be
> STRING, but the data is treated as raw bytes.

If some program other than Emacs is the target of the drop, raw bytes
produced from raw-text will not be meaningful for it.

I actually don't understand why you don't use ENCODE_FILE for files
and ENCODE_SYSTEM for everything else -- this is the only encoding
which we know to be generally suitable for any operation that calls
low-level C APIs whose implementation is not in Emacs.  Bonus points
for adhering to selection-coding-system when that is non-nil.

Are there any known problems with using these two system encodings in
this case?



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: master 6011d39b6a: Fix drag-and-drop of files with multibyte filenames
  2022-06-05 10:31   ` Eli Zaretskii
@ 2022-06-05 11:42     ` Po Lu
  2022-06-05 12:54       ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Po Lu @ 2022-06-05 11:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Then why not encode in UTF-8, for example?

How about (or file-name-coding-system default-file-name-coding-system)
instead?  AFAICT, that's what ENCODE_FILE does.

> If some program other than Emacs is the target of the drop, raw bytes
> produced from raw-text will not be meaningful for it.

Why not?  Aren't those bytes equivalent to a C string describing a file
name that can be passed to `open'?

I wrote that code according to how C_STRINGs are already encoded in
select.el:

	   ((eq type 'C_STRING)
            ;; According to ICCCM Protocol v2.0 (para 2.7.1), C_STRING
            ;; is a zero-terminated sequence of raw bytes that
            ;; shouldn't be interpreted as text in any encoding.
            ;; Therefore, if STR is unibyte (the normal case), we use
            ;; it as-is; otherwise we assume some of the characters
            ;; are eight-bit and ensure they are converted to their
            ;; single-byte representation.
            (or (null (multibyte-string-p str))
                (setq str (encode-coding-string str 'raw-text-unix))))

> I actually don't understand why you don't use ENCODE_FILE for files
> and ENCODE_SYSTEM for everything else -- this is the only encoding
> which we know to be generally suitable for any operation that calls
> low-level C APIs whose implementation is not in Emacs.  Bonus points
> for adhering to selection-coding-system when that is non-nil.
>
> Are there any known problems with using these two system encodings in
> this case?

Yes: the entire selection mechanism is implemented in Lisp, and moving
parts to C specifically would require some rethinking of the C code
involved, and wouldn't be backwards-compatible.

The FILE_NAME target has existed for decades in Lisp for programs that
comply with the ICCCM and also deals with all kinds of file name
encodings (see the call to `xselect--encode-string' in
`xselect-convert-to-filename'), so I don't see why this code cannot.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: master 6011d39b6a: Fix drag-and-drop of files with multibyte filenames
  2022-06-05 11:42     ` Po Lu
@ 2022-06-05 12:54       ` Eli Zaretskii
  2022-06-05 13:07         ` Po Lu
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2022-06-05 12:54 UTC (permalink / raw)
  To: Po Lu; +Cc: emacs-devel

> From: Po Lu <luangruo@yahoo.com>
> Cc: emacs-devel@gnu.org
> Date: Sun, 05 Jun 2022 19:42:49 +0800
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Then why not encode in UTF-8, for example?
> 
> How about (or file-name-coding-system default-file-name-coding-system)
> instead?  AFAICT, that's what ENCODE_FILE does.

Yes.  Sorry, I forgot that the code was in Lisp, not C.

> > If some program other than Emacs is the target of the drop, raw bytes
> > produced from raw-text will not be meaningful for it.
> 
> Why not?  Aren't those bytes equivalent to a C string describing a file
> name that can be passed to `open'?

Not necessarily.  First, non-ASCII characters can be encoded in
different ways, and the other program might not necessarily support
more than just the locale's encoding.  And second, any characters to
which Emacs gives codepoints beyond the Unicode codespace (something
that is rare, but it does happen) will not be understood by the other
programs at all, because their codepoints are completely private to
Emacs.

> I wrote that code according to how C_STRINGs are already encoded in
> select.el:
> 
> 	   ((eq type 'C_STRING)
>             ;; According to ICCCM Protocol v2.0 (para 2.7.1), C_STRING
>             ;; is a zero-terminated sequence of raw bytes that
>             ;; shouldn't be interpreted as text in any encoding.
>             ;; Therefore, if STR is unibyte (the normal case), we use
>             ;; it as-is; otherwise we assume some of the characters
>             ;; are eight-bit and ensure they are converted to their
>             ;; single-byte representation.
>             (or (null (multibyte-string-p str))
>                 (setq str (encode-coding-string str 'raw-text-unix))))

See the comment: it explicitly tells about "strings" that aren't text.
File names are always human-readable text, or at least they should be.

> > I actually don't understand why you don't use ENCODE_FILE for files
> > and ENCODE_SYSTEM for everything else -- this is the only encoding
> > which we know to be generally suitable for any operation that calls
> > low-level C APIs whose implementation is not in Emacs.  Bonus points
> > for adhering to selection-coding-system when that is non-nil.
> >
> > Are there any known problems with using these two system encodings in
> > this case?
> 
> Yes: the entire selection mechanism is implemented in Lisp, and moving
> parts to C specifically would require some rethinking of the C code
> involved, and wouldn't be backwards-compatible.

No need to move anything to C: you can do the same in Lisp.  See
above.

> The FILE_NAME target has existed for decades in Lisp for programs that
> comply with the ICCCM and also deals with all kinds of file name
> encodings (see the call to `xselect--encode-string' in
> `xselect-convert-to-filename'), so I don't see why this code cannot.

<Shrug> I guess that other code is also incorrect, and was never
seriously tested with non-ASCII file names outside of UTF-8 locales.
Try Emacs whose file-name-coding-system is iso-2022-jp or somesuch.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: master 6011d39b6a: Fix drag-and-drop of files with multibyte filenames
  2022-06-05 12:54       ` Eli Zaretskii
@ 2022-06-05 13:07         ` Po Lu
  0 siblings, 0 replies; 6+ messages in thread
From: Po Lu @ 2022-06-05 13:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Yes.  Sorry, I forgot that the code was in Lisp, not C.
>
>> > If some program other than Emacs is the target of the drop, raw bytes
>> > produced from raw-text will not be meaningful for it.
>> 
>> Why not?  Aren't those bytes equivalent to a C string describing a file
>> name that can be passed to `open'?
>
> Not necessarily.  First, non-ASCII characters can be encoded in
> different ways, and the other program might not necessarily support
> more than just the locale's encoding.  And second, any characters to
> which Emacs gives codepoints beyond the Unicode codespace (something
> that is rare, but it does happen) will not be understood by the other
> programs at all, because their codepoints are completely private to
> Emacs.

[...]

> <Shrug> I guess that other code is also incorrect, and was never
> seriously tested with non-ASCII file names outside of UTF-8 locales.
> Try Emacs whose file-name-coding-system is iso-2022-jp or somesuch.

That makes sense now, yes.  Thanks.



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-06-05 13:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-05  9:21 master 6011d39b6a: Fix drag-and-drop of files with multibyte filenames Eli Zaretskii
2022-06-05 10:00 ` Po Lu
2022-06-05 10:31   ` Eli Zaretskii
2022-06-05 11:42     ` Po Lu
2022-06-05 12:54       ` Eli Zaretskii
2022-06-05 13:07         ` Po Lu

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).