bug#31679: 26.1; detect-coding-string does not detect UTF-16

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#31679: 26.1; detect-coding-string does not detect UTF-16
@ 2018-06-01 19:40 Benjamin Riefenstahl
  2018-06-02  7:42 ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Riefenstahl @ 2018-06-01 19:40 UTC (permalink / raw)
  To: 31679

I have been trying this (in real life the strings are often longer, of
course):

  (detect-coding-string "h\0t\0m\0l\0")

And I was surprised that this does not detect UTF-16 but instead gives
(no-conversion).

The result of (coding-system-priority-list) is

   (utf-8 iso-2022-7bit iso-latin-1 iso-2022-7bit-lock iso-2022-8bit-ss2
    emacs-mule raw-text iso-2022-jp in-is13194-devanagari
    chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16
    utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le
    japanese-shift-jis chinese-big5 undecided)

Does this just not work, or am I doing something wrong?

Thanks,
benny


Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
next-line: End of buffer
(no-conversion)
Quit [2 times]
Type C-x 1 to delete the help window.
Mark set
delete-backward-char: Text is read-only

Configured features:
XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND GPM GSETTINGS NOTIFY
LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB
TOOLKIT_SCROLL_BARS GTK2 X11 THREADS LCMS2

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv
bytecomp byte-compile cconv dired dired-loaddefs format-spec rfc822 mml
mml-sec password-cache epa derived epg epg-config gnus-util rmail
rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
mm-util mail-prsvr mail-utils cl-extra help-fns radix-tree help-mode
easymenu cl-loaddefs cl-lib term/xterm xterm time-date elec-pair
mule-util tooltip eldoc electric uniquify ediff-hook vc-hooks
lisp-float-type mwheel term/x-win x-win term/common-win x-dnd tool-bar
dnd fontset image regexp-opt fringe tabulated-list replace newcomment
text-mode elisp-mode lisp-mode prog-mode register page menu-bar
rfn-eshadow isearch timer select scroll-bar mouse jit-lock font-lock
syntax facemenu font-core term/tty-colors frame cl-generic cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european
ethiopic indian cyrillic chinese composite charscript charprop
case-table epa-hook jka-cmpr-hook help simple abbrev obarray minibuffer
cl-preloaded nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote inotify lcms2
dynamic-setting system-font-setting font-render-setting move-toolbar gtk
x-toolkit x multi-tty make-network-process emacs)

Memory information:
((conses 8 102532 5281)
 (symbols 24 20919 1)
 (miscs 20 38 212)
 (strings 16 29808 1314)
 (string-bytes 1 767826)
 (vectors 12 12354)
 (vector-slots 4 470678 7618)
 (floats 8 56 559)
 (intervals 28 260 1)
 (buffers 536 12)
 (heap 1024 30861 580))





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31679: 26.1; detect-coding-string does not detect UTF-16
  2018-06-01 19:40 bug#31679: 26.1; detect-coding-string does not detect UTF-16 Benjamin Riefenstahl
@ 2018-06-02  7:42 ` Eli Zaretskii
  2018-06-02 13:55   ` Benjamin Riefenstahl
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2018-06-02  7:42 UTC (permalink / raw)
  To: Benjamin Riefenstahl; +Cc: 31679

> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
> Date: Fri, 01 Jun 2018 21:40:32 +0200
> 
> I have been trying this (in real life the strings are often longer, of
> course):
> 
>   (detect-coding-string "h\0t\0m\0l\0")
> 
> And I was surprised that this does not detect UTF-16 but instead gives
> (no-conversion).

First, you should lose the trailing null (or add one more), since
UTF-16 strings must, by definition, have an even number of bytes.

Next, you should disable null byte detection by binding
inhibit-null-byte-detection to a non-nil value, because otherwise
Emacs's guesswork will prefer no-conversion, assuming this is binary
data.

If you do that, you get

  (let ((inhibit-null-byte-detection t))
    (detect-coding-string "h\0t\0m\0l"))
  => (undecided)

Why? because it is perfectly valid for a plain-ASCII string to include
null bytes, so Emacs prefers to guess ASCII.

As another example, try this:

  (prefer-coding-system 'utf-16)
  (let ((inhibit-null-byte-detection t))
    (detect-coding-string (encode-coding-string "áçðë" 'utf-16-be) t))
  => utf-16

but

  (let ((inhibit-null-byte-detection t))
    (detect-coding-string
      (substring (encode-coding-string "áçðë" 'utf-16-be) 2) t))
  =>iso-latin-1

So even when UTF-16 is the most preferred encoding, just removing the
BOM is enough to let Emacs prefer something other than UTF-16.

Morale: detecting an encoding in Emacs is based on heuristic
_guesswork_, which is heavily biased to what is deemed to be the most
frequent use cases.  And UTF-16 is quite infrequent, at least on Posix
hosts.

IOW, detecting encoding in Emacs is not as reliable as you seem to
expect.  If you _know_ the text is in UTF-16, just tell Emacs to use
that, don't let it guess.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31679: 26.1; detect-coding-string does not detect UTF-16
  2018-06-02  7:42 ` Eli Zaretskii
@ 2018-06-02 13:55   ` Benjamin Riefenstahl
  2018-06-02 14:24     ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Riefenstahl @ 2018-06-02 13:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 31679

Hi Eli,

>> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
>>   (detect-coding-string "h\0t\0m\0l\0")
>> 
>> And I was surprised that this does not detect UTF-16 but instead gives
>> (no-conversion).

Eli Zaretskii writes:
> First, you should lose the trailing null (or add one more), since
> UTF-16 strings must, by definition, have an even number of bytes.

Actually this string *has* 8 bytes, the last '\0' completes the 'l' to
form the two-byte character.

> Next, you should disable null byte detection by binding
> inhibit-null-byte-detection to a non-nil value, because otherwise
> Emacs's guesswork will prefer no-conversion, assuming this is binary
> data.

O.k. that is a good tip. 

> Why? because it is perfectly valid for a plain-ASCII string to include
> null bytes, so Emacs prefers to guess ASCII.

While NUL is a valid ASCII character according to the standard,
practically nobody uses it as a character.  So for a heuristic in this
context, it would be a bad decision to treat it just as another
character.

And indeed NUL bytes are treated as a strong indication of binary data,
it seems.  I tried to debug this.  The C routine detect_coding_utf_16
tries to distinguish between binary and UTF-16, but it is not called for
the string above.  That routine is called OTOH, when I add a non-ASCII
character as in "h\0t\0m\0l\0ü\0", but even than it decides that the
string is not UTF-16 (?).

> Morale: detecting an encoding in Emacs is based on heuristic
> _guesswork_, which is heavily biased to what is deemed to be the most
> frequent use cases.  And UTF-16 is quite infrequent, at least on Posix
> hosts.
>
> IOW, detecting encoding in Emacs is not as reliable as you seem to
> expect.  If you _know_ the text is in UTF-16, just tell Emacs to use
> that, don't let it guess.

My use-case is that I am trying to paste types other than UTF8_STRING
from the X11 clipboard, and have them handled as automatically as
possible.  While official clipboard types probably have a documented
encoding (and I have code for those), applications like Firefox also put
private formats there.  And Firefox seems to like UTF-16, even the
text/html format it puts there is UTF-16.

I have tried to debug the C routines that implement this (s.a.), but the
code is somewhat hairy.  I guess I'll have another look to see if I can
understand it better.

Thanks so far,
benny

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31679: 26.1; detect-coding-string does not detect UTF-16
  2018-06-02 13:55   ` Benjamin Riefenstahl
@ 2018-06-02 14:24     ` Eli Zaretskii
  2021-08-12 13:51       ` Lars Ingebrigtsen
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2018-06-02 14:24 UTC (permalink / raw)
  To: Benjamin Riefenstahl; +Cc: 31679

> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
> Cc: 31679@debbugs.gnu.org
> Date: Sat, 02 Jun 2018 15:55:49 +0200
> 
> > First, you should lose the trailing null (or add one more), since
> > UTF-16 strings must, by definition, have an even number of bytes.
> 
> Actually this string *has* 8 bytes, the last '\0' completes the 'l' to
> form the two-byte character.

Oops.  I guess I modified the string while playing with the example
and ended up with one more null.

> > Why? because it is perfectly valid for a plain-ASCII string to include
> > null bytes, so Emacs prefers to guess ASCII.
> 
> While NUL is a valid ASCII character according to the standard,
> practically nobody uses it as a character.  So for a heuristic in this
> context, it would be a bad decision to treat it just as another
> character.

That's because you _know_ this is supposed to be human-readable text,
made of non-null characters.  But Emacs doesn't.

> And indeed NUL bytes are treated as a strong indication of binary data,
> it seems.  I tried to debug this.  The C routine detect_coding_utf_16
> tries to distinguish between binary and UTF-16, but it is not called for
> the string above.  That routine is called OTOH, when I add a non-ASCII
> character as in "h\0t\0m\0l\0ü\0", but even than it decides that the
> string is not UTF-16 (?).

Don't forget that decoding is supposed to be fast, because it's
something Emacs does each time it visits a file or accepts input from
a subprocess.  So it tries not to go through all the possible
encodings, but instead bails out as soon as it thinks it has found a
good guess.

> > Morale: detecting an encoding in Emacs is based on heuristic
> > _guesswork_, which is heavily biased to what is deemed to be the most
> > frequent use cases.  And UTF-16 is quite infrequent, at least on Posix
> > hosts.
> >
> > IOW, detecting encoding in Emacs is not as reliable as you seem to
> > expect.  If you _know_ the text is in UTF-16, just tell Emacs to use
> > that, don't let it guess.
> 
> My use-case is that I am trying to paste types other than UTF8_STRING
> from the X11 clipboard, and have them handled as automatically as
> possible.  While official clipboard types probably have a documented
> encoding (and I have code for those), applications like Firefox also put
> private formats there.  And Firefox seems to like UTF-16, even the
> text/html format it puts there is UTF-16.

If you have a special application in mind, you could always write some
simple enough code in Lisp to see if UTF-16 should be tried, then tell
Emacs to try that explicitly.

> I have tried to debug the C routines that implement this (s.a.), but the
> code is somewhat hairy.  I guess I'll have another look to see if I can
> understand it better.

We could add code to detect_coding_system that looks at some short
enough prefix of the text and sees whether there's a null byte there
for each non-null byte, and try UTF-16 if so.  Assuming that we want
to improve the chances of having UTF-16 detected for a small penalty,
that is.

Thanks.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31679: 26.1; detect-coding-string does not detect UTF-16
  2018-06-02 14:24     ` Eli Zaretskii
@ 2021-08-12 13:51       ` Lars Ingebrigtsen
  2021-09-09 15:23         ` Lars Ingebrigtsen
  0 siblings, 1 reply; 6+ messages in thread
From: Lars Ingebrigtsen @ 2021-08-12 13:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 31679, Benjamin Riefenstahl

Eli Zaretskii <eliz@gnu.org> writes:

>> My use-case is that I am trying to paste types other than UTF8_STRING
>> from the X11 clipboard, and have them handled as automatically as
>> possible.  While official clipboard types probably have a documented
>> encoding (and I have code for those), applications like Firefox also put
>> private formats there.  And Firefox seems to like UTF-16, even the
>> text/html format it puts there is UTF-16.
>
> If you have a special application in mind, you could always write some
> simple enough code in Lisp to see if UTF-16 should be tried, then tell
> Emacs to try that explicitly.

I ran into the same issue when dealing with X selections -- but there's
even more peculiarities in that area (some selections add a spurious nul
to the end, and some done), so you have to write a bit of code around
this: `decode-coding-string' in itself can't be expected to deal/guess
all these oddities (as you say).

>> I have tried to debug the C routines that implement this (s.a.), but the
>> code is somewhat hairy.  I guess I'll have another look to see if I can
>> understand it better.
>
> We could add code to detect_coding_system that looks at some short
> enough prefix of the text and sees whether there's a null byte there
> for each non-null byte, and try UTF-16 if so.  Assuming that we want
> to improve the chances of having UTF-16 detected for a small penalty,
> that is.

I do think that, in general, it would be nice if detect_coding_system
did try a bit harder to guess at utf-16.  For instance, if (in the first
X bytes of the string) more than 90% of the byte pairs look like
non-nul/nul pairs, then it's pretty likely to be utf-16.  (And I think
that would be easy enough to implement?)

On the other hand, as you point out, there's a performance penalty that
may not be worth it.

So...  uhm...  does anybody have an opinion here?  Try harder for utf-16
or just leave it as it is?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31679: 26.1; detect-coding-string does not detect UTF-16
  2021-08-12 13:51       ` Lars Ingebrigtsen
@ 2021-09-09 15:23         ` Lars Ingebrigtsen
  0 siblings, 0 replies; 6+ messages in thread
From: Lars Ingebrigtsen @ 2021-09-09 15:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 31679, Benjamin Riefenstahl

Lars Ingebrigtsen <larsi@gnus.org> writes:

> On the other hand, as you point out, there's a performance penalty that
> may not be worth it.
>
> So...  uhm...  does anybody have an opinion here?  Try harder for utf-16
> or just leave it as it is?

Nobody had an opinion in a month, so I'm closing this bug report.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-09 15:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-01 19:40 bug#31679: 26.1; detect-coding-string does not detect UTF-16 Benjamin Riefenstahl
2018-06-02  7:42 ` Eli Zaretskii
2018-06-02 13:55   ` Benjamin Riefenstahl
2018-06-02 14:24     ` Eli Zaretskii
2021-08-12 13:51       ` Lars Ingebrigtsen
2021-09-09 15:23         ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).