unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* String handling in xwidget primitives
@ 2016-01-29 19:16 Eli Zaretskii
  2016-01-29 22:57 ` Paul Eggert
  0 siblings, 1 reply; 4+ messages in thread
From: Eli Zaretskii @ 2016-01-29 19:16 UTC (permalink / raw)
  To: joakim; +Cc: emacs-devel

The primitives xwidget-webkit-goto-uri and
xwidget-webkit-execute-script accept Lisp strings as arguments and
pass their data unaltered to the underlying GTK functions.  I think we
need to encode these strings first, but I cannot figure out which
encoding should be used.  Is it UTF-8 or something locale-dependent?
xwidget-webkit-goto-uri accepts file names (AFAIU), so perhaps it
should encode the argument as we do with file names?

Also, random documents on the Internet claim JS scripts should have a
BOM if they are in UTF-8, is that correct?

xwidget-webkit-get-title uses build_string to create a Lisp string
which it returns, but build_string is not really appropriate for
non-ASCII strings.  In what encoding does webkit_web_view_get_title
return its value?  Is that UTF-8, or could that be something else?  (I
cannot find any documentation of that.)

In any case, what we have now is incorrect, and can only work by luck.
We ought to fix that before the release.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: String handling in xwidget primitives
       [not found] ` <m3lh78rxwe.fsf@exodia.verona.se>
@ 2016-01-29 20:25   ` Eli Zaretskii
  0 siblings, 0 replies; 4+ messages in thread
From: Eli Zaretskii @ 2016-01-29 20:25 UTC (permalink / raw)
  To: joakim; +Cc: emacs-devel

> From: joakim@verona.se
> Date: Fri, 29 Jan 2016 20:25:21 +0100
> 
> I briefly tested this:
> 
> (xwidget-webkit-execute-script (xwidget-at 0) "alert('𝌆')")
> 
> where 𝌆 is some kind of unicode char i stole from
> 
> https://mathiasbynens.be/notes/javascript-encoding
> this page seems to indicate utf-16 is used.

I've seen such claims.  But they cannot be true, since if they were,
we couldn't have passed pure ASCII strings to those interfaces without
triggering weird errors: each ASCII character takes 2 bytes in UTF-16,
not one.

I think UTF-16 is used internally to represent strings, but the script
itself should not be in UTF-16.  I think it should be either in UTF-8
(and then requires a BOM), or it should include the charset= metadata
to indicate its encoding.

> I executed the code in a buffer containing a webkit instance, and the
> char showed up in an alert box originating from the wekit instance.
> 
> This doesnt actually prove anything, but it does seem to show that in my
> case on my machine and environment, at least something goes right.

Sheer luck: you just didn't bump into all those subtleties which make
the internal representation of strings in Emacs be a superset of
UTF-8, but not exactly UTF-8.

> If we do need to encode, do you know some part of the emacs src i can
> see which functions to use?

It depends how we need to encode.  In general,
code_convert_string_norecord is the most frequently used function in
these cases.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: String handling in xwidget primitives
  2016-01-29 19:16 Eli Zaretskii
@ 2016-01-29 22:57 ` Paul Eggert
  2016-01-30  7:57   ` Eli Zaretskii
  0 siblings, 1 reply; 4+ messages in thread
From: Paul Eggert @ 2016-01-29 22:57 UTC (permalink / raw)
  To: Eli Zaretskii, joakim; +Cc: emacs-devel

On 01/29/2016 11:16 AM, Eli Zaretskii wrote:
> The primitives xwidget-webkit-goto-uri and
> xwidget-webkit-execute-script accept Lisp strings as arguments and
> pass their data unaltered to the underlying GTK functions.  I think we
> need to encode these strings first, but I cannot figure out which
> encoding should be used.  Is it UTF-8 or something locale-dependent?

As I understand it the default is UTF-8, but you can override this by 
using a custom encoding. I'd guess we should just use the default.

Dumb question: shouldn't URIs be encoded in punycode? See the thread 
starting here:

https://lists.gnu.org/archive/html/emacs-devel/2015-12/msg01373.html

> Also, random documents on the Internet claim JS scripts should have a
> BOM if they are in UTF-8, is that correct?
>

I'm skeptical. No doubt there are issues in this area, but I can also 
find random documents saying that JS scripts *with* BOMs make programs 
croak, e.g.:

http://compgroups.net/comp.lang.php/javascript-php-byte-order-mark-problem/1384837

Plus, I see some evidence that at least one JavaScript linter will warn 
you about BOMs:

https://github.com/jshint/jshint/pull/2285



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: String handling in xwidget primitives
  2016-01-29 22:57 ` Paul Eggert
@ 2016-01-30  7:57   ` Eli Zaretskii
  0 siblings, 0 replies; 4+ messages in thread
From: Eli Zaretskii @ 2016-01-30  7:57 UTC (permalink / raw)
  To: Paul Eggert; +Cc: joakim, emacs-devel

> Cc: emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Fri, 29 Jan 2016 14:57:20 -0800
> 
> On 01/29/2016 11:16 AM, Eli Zaretskii wrote:
> > The primitives xwidget-webkit-goto-uri and
> > xwidget-webkit-execute-script accept Lisp strings as arguments and
> > pass their data unaltered to the underlying GTK functions.  I think we
> > need to encode these strings first, but I cannot figure out which
> > encoding should be used.  Is it UTF-8 or something locale-dependent?
> 
> As I understand it the default is UTF-8, but you can override this by 
> using a custom encoding. I'd guess we should just use the default.

Sure, if UTF-8 is accepted by default, it's the best and easiest
alternative.

> Dumb question: shouldn't URIs be encoded in punycode?

Good question.  I don't know.  The URI gets passed to the
webkit_web_view_load_uri API from WebKitGTK, whose documentation says
nothing about this (or the encoding in general).  Maybe someone could
look in the sources and figure out what's TRT, or find the information
somewhere.  My personal impression from googling about this is that at
least JS seems to not expect URIs in punycode.  But I may be mistaken.

> > Also, random documents on the Internet claim JS scripts should have a
> > BOM if they are in UTF-8, is that correct?
> >
> 
> I'm skeptical. No doubt there are issues in this area, but I can also 
> find random documents saying that JS scripts *with* BOMs make programs 
> croak, e.g.:
> 
> http://compgroups.net/comp.lang.php/javascript-php-byte-order-mark-problem/1384837
> 
> Plus, I see some evidence that at least one JavaScript linter will warn 
> you about BOMs:
> 
> https://github.com/jshint/jshint/pull/2285

Thanks, I guess that answers the question.



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-01-30  7:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <83bn84xn2p.fsf@gnu.org>
     [not found] ` <m3lh78rxwe.fsf@exodia.verona.se>
2016-01-29 20:25   ` String handling in xwidget primitives Eli Zaretskii
2016-01-29 19:16 Eli Zaretskii
2016-01-29 22:57 ` Paul Eggert
2016-01-30  7:57   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).