all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* default charset for text/html selection in X11
@ 2023-06-21 15:51 Robert Pluim
  2023-06-21 17:13 ` Eli Zaretskii
  2023-06-22  0:56 ` Po Lu
  0 siblings, 2 replies; 12+ messages in thread
From: Robert Pluim @ 2023-06-21 15:51 UTC (permalink / raw)
  To: emacs-devel

Hi,

Iʼve been playing around with the `yank-media' stuff Lars added, and
Iʼve noticed that when yanking a selection with mime-type text/html
from Chromium, what Iʼm getting is a utf-8 encoded string, which makes
this:

(defun html-mode--html-yank-handler (_type html)
  (save-restriction
    (insert html)
    (ignore-errors
      (sgml-pretty-print (point-min) (point-max)))))

insert any codepoints > 127 as their constituent raw bytes
instead, eg U+A0 ends up as \xc2\xa0 in the buffer.

I *think* it should be OK to assume utf-8 here, and thus do:

(defun html-mode--html-yank-handler (_type html)
  (save-restriction
    (insert (decode-coding-string html 'utf-8 t))
    (ignore-errors
      (sgml-pretty-print (point-min) (point-max)))))

but I canʼt find a normative reference for that (if this was http, the
default charset would be iso-8859-1, but this isnʼt http).

Robert
-- 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-21 15:51 default charset for text/html selection in X11 Robert Pluim
@ 2023-06-21 17:13 ` Eli Zaretskii
  2023-06-22  0:56 ` Po Lu
  1 sibling, 0 replies; 12+ messages in thread
From: Eli Zaretskii @ 2023-06-21 17:13 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Date: Wed, 21 Jun 2023 17:51:19 +0200
> 
> (defun html-mode--html-yank-handler (_type html)
>   (save-restriction
>     (insert html)
>     (ignore-errors
>       (sgml-pretty-print (point-min) (point-max)))))
> 
> insert any codepoints > 127 as their constituent raw bytes
> instead, eg U+A0 ends up as \xc2\xa0 in the buffer.
> 
> I *think* it should be OK to assume utf-8 here, and thus do:
> 
> (defun html-mode--html-yank-handler (_type html)
>   (save-restriction
>     (insert (decode-coding-string html 'utf-8 t))
>     (ignore-errors
>       (sgml-pretty-print (point-min) (point-max)))))
> 
> but I canʼt find a normative reference for that (if this was http, the
> default charset would be iso-8859-1, but this isnʼt http).

How about looking in the sources of Chromium?

If the encoding doesn't have to be UTF-8, forcing UTF-8 there might
not be the best idea.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-21 15:51 default charset for text/html selection in X11 Robert Pluim
  2023-06-21 17:13 ` Eli Zaretskii
@ 2023-06-22  0:56 ` Po Lu
  2023-06-22  3:37   ` Po Lu
  1 sibling, 1 reply; 12+ messages in thread
From: Po Lu @ 2023-06-22  0:56 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

Robert Pluim <rpluim@gmail.com> writes:

> Hi,
>
> Iʼve been playing around with the `yank-media' stuff Lars added, and
> Iʼve noticed that when yanking a selection with mime-type text/html
> from Chromium, what Iʼm getting is a utf-8 encoded string, which makes
> this:
>
> (defun html-mode--html-yank-handler (_type html)
>   (save-restriction
>     (insert html)
>     (ignore-errors
>       (sgml-pretty-print (point-min) (point-max)))))
>
> insert any codepoints > 127 as their constituent raw bytes
> instead, eg U+A0 ends up as \xc2\xa0 in the buffer.
>
> I *think* it should be OK to assume utf-8 here, and thus do:
>
> (defun html-mode--html-yank-handler (_type html)
>   (save-restriction
>     (insert (decode-coding-string html 'utf-8 t))
>     (ignore-errors
>       (sgml-pretty-print (point-min) (point-max)))))
>
> but I canʼt find a normative reference for that (if this was http, the
> default charset would be iso-8859-1, but this isnʼt http).
>
> Robert

What is the type of the string?  IOW, what's

  (get-text-property html 'foreign-selection)

?

This should be one of the usual X11 string formats: STRING
(iso-latin-1), COMPOUND_TEXT (compound-text-with-extensions), or
UTF8_STRING (utf-8).

If it's anything else, Emacs should try to detect the encoding
automatically, and fall back to Latin-1 if that fails.

Thanks.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22  0:56 ` Po Lu
@ 2023-06-22  3:37   ` Po Lu
  2023-06-22  7:23     ` Robert Pluim
  0 siblings, 1 reply; 12+ messages in thread
From: Po Lu @ 2023-06-22  3:37 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

Po Lu <luangruo@yahoo.com> writes:

> What is the type of the string?  IOW, what's
>
>   (get-text-property html 'foreign-selection)

(get-text-property 0 html 'foreign-selection), of course.  Sorry about
the confusion.

> ?
>
> This should be one of the usual X11 string formats: STRING
> (iso-latin-1), COMPOUND_TEXT (compound-text-with-extensions), or
> UTF8_STRING (utf-8).

Oh, and C_STRING, which is ASCII.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22  3:37   ` Po Lu
@ 2023-06-22  7:23     ` Robert Pluim
  2023-06-22  7:57       ` Po Lu
  2023-06-22 10:08       ` Eli Zaretskii
  0 siblings, 2 replies; 12+ messages in thread
From: Robert Pluim @ 2023-06-22  7:23 UTC (permalink / raw)
  To: Po Lu; +Cc: emacs-devel

>>>>> On Thu, 22 Jun 2023 11:37:14 +0800, Po Lu <luangruo@yahoo.com> said:

    Po Lu> Po Lu <luangruo@yahoo.com> writes:
    >> What is the type of the string?  IOW, what's
    >> 
    >> (get-text-property html 'foreign-selection)

    Po Lu> (get-text-property 0 html 'foreign-selection), of course.  Sorry about
    Po Lu> the confusion.

(get-text-property 0 'foreign-selection html) => STRING

but itʼs definitely a utf-8 string, not iso-latin-1.

    >> ?
    >> 
    >> This should be one of the usual X11 string formats: STRING
    >> (iso-latin-1), COMPOUND_TEXT (compound-text-with-extensions), or
    >> UTF8_STRING (utf-8).

    Po Lu> Oh, and C_STRING, which is ASCII.

Robert
-- 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22  7:23     ` Robert Pluim
@ 2023-06-22  7:57       ` Po Lu
  2023-06-22  9:07         ` Robert Pluim
  2023-06-22 10:08       ` Eli Zaretskii
  1 sibling, 1 reply; 12+ messages in thread
From: Po Lu @ 2023-06-22  7:57 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

Robert Pluim <rpluim@gmail.com> writes:

>>>>>> On Thu, 22 Jun 2023 11:37:14 +0800, Po Lu <luangruo@yahoo.com> said:
>
>     Po Lu> Po Lu <luangruo@yahoo.com> writes:
>     >> What is the type of the string?  IOW, what's
>     >> 
>     >> (get-text-property html 'foreign-selection)
>
>     Po Lu> (get-text-property 0 html 'foreign-selection), of course.  Sorry about
>     Po Lu> the confusion.
>
> (get-text-property 0 'foreign-selection html) => STRING
>
> but itʼs definitely a utf-8 string, not iso-latin-1.

Would you please report this as a bug, to the Chromium developers?
That is, if:

  (x-get-selection-internal 'CLIPBOARD 'text/html)

returns a string of the same type.

The ICCCM clearly states that:

  STRING as a type or a target specifies the ISO Latin-1 character set
  plus the control characters TAB (octal 11) and NEWLINE (octal 12.)
  The spacing interpretation of TAB is context dependent.  Other ASCII
  control characters are explicitly not included in STRING at the
  present time.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22  7:57       ` Po Lu
@ 2023-06-22  9:07         ` Robert Pluim
  2023-06-22 11:48           ` Po Lu
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Pluim @ 2023-06-22  9:07 UTC (permalink / raw)
  To: Po Lu; +Cc: emacs-devel

>>>>> On Thu, 22 Jun 2023 15:57:59 +0800, Po Lu <luangruo@yahoo.com> said:

    Po Lu> Robert Pluim <rpluim@gmail.com> writes:
    >>>>>>> On Thu, 22 Jun 2023 11:37:14 +0800, Po Lu <luangruo@yahoo.com> said:
    >> 
    >> Po Lu> Po Lu <luangruo@yahoo.com> writes:
    >> >> What is the type of the string?  IOW, what's
    >> >> 
    >> >> (get-text-property html 'foreign-selection)
    >> 
    >> Po Lu> (get-text-property 0 html 'foreign-selection), of course.  Sorry about
    >> Po Lu> the confusion.
    >> 
    >> (get-text-property 0 'foreign-selection html) => STRING
    >> 
    >> but itʼs definitely a utf-8 string, not iso-latin-1.

    Po Lu> Would you please report this as a bug, to the Chromium developers?
    Po Lu> That is, if:

    Po Lu>   (x-get-selection-internal 'CLIPBOARD 'text/html)

    Po Lu> returns a string of the same type.

It does.

    Po Lu> The ICCCM clearly states that:

    Po Lu>   STRING as a type or a target specifies the ISO Latin-1 character set
    Po Lu>   plus the control characters TAB (octal 11) and NEWLINE (octal 12.)
    Po Lu>   The spacing interpretation of TAB is context dependent.  Other ASCII
    Po Lu>   control characters are explicitly not included in STRING at the
    Po Lu>   present time.

Iʼm not about to contradict the ICCCM, but `gui-get-selection' does
the following

                    ;; Guess at the charset for types like text/html
                    ;; -- it can be anything, and different
                    ;; applications use different encodings.
                    ((string-match-p "\\`text/" (symbol-name data-type))
                     (decode-coding-string
                      data (car (detect-coding-string data))))
                    ;; Do nothing.

I took a closer look, and `yank-media' does the wrong thing, but
`(yank-media-types t)' and selecting "text/html" does the right
thing. The difference is that the former uses
`gui-backend-get-selection', and the latter uses `gui-get-selection',
and thus does the auto-detection.

Robert
-- 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22  7:23     ` Robert Pluim
  2023-06-22  7:57       ` Po Lu
@ 2023-06-22 10:08       ` Eli Zaretskii
  2023-06-22 12:14         ` Robert Pluim
  1 sibling, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2023-06-22 10:08 UTC (permalink / raw)
  To: Robert Pluim; +Cc: luangruo, emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Thu, 22 Jun 2023 09:23:19 +0200
> 
> >>>>> On Thu, 22 Jun 2023 11:37:14 +0800, Po Lu <luangruo@yahoo.com> said:
> 
>     Po Lu> Po Lu <luangruo@yahoo.com> writes:
>     >> What is the type of the string?  IOW, what's
>     >> 
>     >> (get-text-property html 'foreign-selection)
> 
>     Po Lu> (get-text-property 0 html 'foreign-selection), of course.  Sorry about
>     Po Lu> the confusion.
> 
> (get-text-property 0 'foreign-selection html) => STRING
> 
> but itʼs definitely a utf-8 string, not iso-latin-1.

Does that depend on the locale, per chance?  Could you try changing
the locale's codeset and looking at what Chromium produces then?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22  9:07         ` Robert Pluim
@ 2023-06-22 11:48           ` Po Lu
  2023-06-22 12:27             ` Robert Pluim
  0 siblings, 1 reply; 12+ messages in thread
From: Po Lu @ 2023-06-22 11:48 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

Robert Pluim <rpluim@gmail.com> writes:

> It does.

Thanks.  The bug lies in Chromium, not Emacs.

>     Po Lu> The ICCCM clearly states that:
>
>     Po Lu>   STRING as a type or a target specifies the ISO Latin-1 character set
>     Po Lu>   plus the control characters TAB (octal 11) and NEWLINE (octal 12.)
>     Po Lu>   The spacing interpretation of TAB is context dependent.  Other ASCII
>     Po Lu>   control characters are explicitly not included in STRING at the
>     Po Lu>   present time.
>
> Iʼm not about to contradict the ICCCM, but `gui-get-selection' does
> the following
>
>                     ;; Guess at the charset for types like text/html
>                     ;; -- it can be anything, and different
>                     ;; applications use different encodings.
>                     ((string-match-p "\\`text/" (symbol-name data-type))
>                      (decode-coding-string
>                       data (car (detect-coding-string data))))
>                     ;; Do nothing.
>
> I took a closer look, and `yank-media' does the wrong thing, but
> `(yank-media-types t)' and selecting "text/html" does the right
> thing. The difference is that the former uses
> `gui-backend-get-selection', and the latter uses `gui-get-selection',
> and thus does the auto-detection.

If this does solve the problem, please modify yank-media in such a
manner.  It will make Emacs more robust against non-compliant selection
owners, which is always welcome.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22 10:08       ` Eli Zaretskii
@ 2023-06-22 12:14         ` Robert Pluim
  2023-06-22 12:26           ` Yuri Khan
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Pluim @ 2023-06-22 12:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: luangruo, emacs-devel, Lars Ingebrigtsen

>>>>> On Thu, 22 Jun 2023 13:08:56 +0300, Eli Zaretskii <eliz@gnu.org> said:
    >> (get-text-property 0 'foreign-selection html) => STRING
    >> 
    >> but itʼs definitely a utf-8 string, not iso-latin-1.

    Eli> Does that depend on the locale, per chance?  Could you try changing
    Eli> the locale's codeset and looking at what Chromium produces then?

LC_ALL=C chromium

makes no difference, although

LC_ALL=C firefox

results in U+00A0 being transferred as U+0020, but something like

U+2764 U+FE0F still gives me utf-8 constituent bytes:

\xe2\x9d\xa4\xef\xb8\8f

so Iʼm thinking that using `gui-get-selection' is the right thing. Lars,
can you shed any light on why yank-media uses
`gui-backend-get-selection' rather than `gui-get-selection'?

Robert
-- 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22 12:14         ` Robert Pluim
@ 2023-06-22 12:26           ` Yuri Khan
  0 siblings, 0 replies; 12+ messages in thread
From: Yuri Khan @ 2023-06-22 12:26 UTC (permalink / raw)
  To: Robert Pluim; +Cc: Eli Zaretskii, luangruo, emacs-devel, Lars Ingebrigtsen

On Thu, 22 Jun 2023 at 19:15, Robert Pluim <rpluim@gmail.com> wrote:

> LC_ALL=C chromium
>
> makes no difference, although
>
> LC_ALL=C firefox
>
> results in U+00A0 being transferred as U+0020

<tangent>

Firefox has an issue (arguably bug) where it turns all non-breaking
spaces into regular spaces when copying to clipboard or, I guess, X
selections.

https://bugzilla.mozilla.org/show_bug.cgi?id=1769534

</tangent>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: default charset for text/html selection in X11
  2023-06-22 11:48           ` Po Lu
@ 2023-06-22 12:27             ` Robert Pluim
  0 siblings, 0 replies; 12+ messages in thread
From: Robert Pluim @ 2023-06-22 12:27 UTC (permalink / raw)
  To: Po Lu; +Cc: emacs-devel

>>>>> On Thu, 22 Jun 2023 19:48:46 +0800, Po Lu <luangruo@yahoo.com> said:

    Po Lu> Robert Pluim <rpluim@gmail.com> writes:
    >> It does.

    Po Lu> Thanks.  The bug lies in Chromium, not Emacs.

And Iʼm not the first to notice:

https://bugs.chromium.org/p/chromium/issues/detail?id=760613&q=component%3ABlink%3EDataTransfer%20utf8&can=2

(although their proposed solution looks wrong)

    >> I took a closer look, and `yank-media' does the wrong thing, but
    >> `(yank-media-types t)' and selecting "text/html" does the right
    >> thing. The difference is that the former uses
    >> `gui-backend-get-selection', and the latter uses `gui-get-selection',
    >> and thus does the auto-detection.

    Po Lu> If this does solve the problem, please modify yank-media in such a
    Po Lu> manner.  It will make Emacs more robust against non-compliant selection
    Po Lu> owners, which is always welcome.

Iʼll do that, but not straight away :-)

Robert
-- 



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-06-22 12:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-21 15:51 default charset for text/html selection in X11 Robert Pluim
2023-06-21 17:13 ` Eli Zaretskii
2023-06-22  0:56 ` Po Lu
2023-06-22  3:37   ` Po Lu
2023-06-22  7:23     ` Robert Pluim
2023-06-22  7:57       ` Po Lu
2023-06-22  9:07         ` Robert Pluim
2023-06-22 11:48           ` Po Lu
2023-06-22 12:27             ` Robert Pluim
2023-06-22 10:08       ` Eli Zaretskii
2023-06-22 12:14         ` Robert Pluim
2023-06-22 12:26           ` Yuri Khan

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.