* bug#31679: 26.1; detect-coding-string does not detect UTF-16
@ 2018-06-01 19:40 Benjamin Riefenstahl
2018-06-02 7:42 ` Eli Zaretskii
0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Riefenstahl @ 2018-06-01 19:40 UTC (permalink / raw)
To: 31679
I have been trying this (in real life the strings are often longer, of
course):
(detect-coding-string "h\0t\0m\0l\0")
And I was surprised that this does not detect UTF-16 but instead gives
(no-conversion).
The result of (coding-system-priority-list) is
(utf-8 iso-2022-7bit iso-latin-1 iso-2022-7bit-lock iso-2022-8bit-ss2
emacs-mule raw-text iso-2022-jp in-is13194-devanagari
chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16
utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le
japanese-shift-jis chinese-big5 undecided)
Does this just not work, or am I doing something wrong?
Thanks,
benny
Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
next-line: End of buffer
(no-conversion)
Quit [2 times]
Type C-x 1 to delete the help window.
Mark set
delete-backward-char: Text is read-only
Configured features:
XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND GPM GSETTINGS NOTIFY
LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB
TOOLKIT_SCROLL_BARS GTK2 X11 THREADS LCMS2
Important settings:
value of $LANG: en_US.UTF-8
locale-coding-system: utf-8-unix
Major mode: Lisp Interaction
Minor modes in effect:
tooltip-mode: t
global-eldoc-mode: t
eldoc-mode: t
electric-indent-mode: t
mouse-wheel-mode: t
tool-bar-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
line-number-mode: t
transient-mark-mode: t
Load-path shadows:
None found.
Features:
(shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv
bytecomp byte-compile cconv dired dired-loaddefs format-spec rfc822 mml
mml-sec password-cache epa derived epg epg-config gnus-util rmail
rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
mm-util mail-prsvr mail-utils cl-extra help-fns radix-tree help-mode
easymenu cl-loaddefs cl-lib term/xterm xterm time-date elec-pair
mule-util tooltip eldoc electric uniquify ediff-hook vc-hooks
lisp-float-type mwheel term/x-win x-win term/common-win x-dnd tool-bar
dnd fontset image regexp-opt fringe tabulated-list replace newcomment
text-mode elisp-mode lisp-mode prog-mode register page menu-bar
rfn-eshadow isearch timer select scroll-bar mouse jit-lock font-lock
syntax facemenu font-core term/tty-colors frame cl-generic cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european
ethiopic indian cyrillic chinese composite charscript charprop
case-table epa-hook jka-cmpr-hook help simple abbrev obarray minibuffer
cl-preloaded nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote inotify lcms2
dynamic-setting system-font-setting font-render-setting move-toolbar gtk
x-toolkit x multi-tty make-network-process emacs)
Memory information:
((conses 8 102532 5281)
(symbols 24 20919 1)
(miscs 20 38 212)
(strings 16 29808 1314)
(string-bytes 1 767826)
(vectors 12 12354)
(vector-slots 4 470678 7618)
(floats 8 56 559)
(intervals 28 260 1)
(buffers 536 12)
(heap 1024 30861 580))
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16
2018-06-01 19:40 bug#31679: 26.1; detect-coding-string does not detect UTF-16 Benjamin Riefenstahl
@ 2018-06-02 7:42 ` Eli Zaretskii
2018-06-02 13:55 ` Benjamin Riefenstahl
0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2018-06-02 7:42 UTC (permalink / raw)
To: Benjamin Riefenstahl; +Cc: 31679
> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
> Date: Fri, 01 Jun 2018 21:40:32 +0200
>
> I have been trying this (in real life the strings are often longer, of
> course):
>
> (detect-coding-string "h\0t\0m\0l\0")
>
> And I was surprised that this does not detect UTF-16 but instead gives
> (no-conversion).
First, you should lose the trailing null (or add one more), since
UTF-16 strings must, by definition, have an even number of bytes.
Next, you should disable null byte detection by binding
inhibit-null-byte-detection to a non-nil value, because otherwise
Emacs's guesswork will prefer no-conversion, assuming this is binary
data.
If you do that, you get
(let ((inhibit-null-byte-detection t))
(detect-coding-string "h\0t\0m\0l"))
=> (undecided)
Why? because it is perfectly valid for a plain-ASCII string to include
null bytes, so Emacs prefers to guess ASCII.
As another example, try this:
(prefer-coding-system 'utf-16)
(let ((inhibit-null-byte-detection t))
(detect-coding-string (encode-coding-string "áçðë" 'utf-16-be) t))
=> utf-16
but
(let ((inhibit-null-byte-detection t))
(detect-coding-string
(substring (encode-coding-string "áçðë" 'utf-16-be) 2) t))
=>iso-latin-1
So even when UTF-16 is the most preferred encoding, just removing the
BOM is enough to let Emacs prefer something other than UTF-16.
Morale: detecting an encoding in Emacs is based on heuristic
_guesswork_, which is heavily biased to what is deemed to be the most
frequent use cases. And UTF-16 is quite infrequent, at least on Posix
hosts.
IOW, detecting encoding in Emacs is not as reliable as you seem to
expect. If you _know_ the text is in UTF-16, just tell Emacs to use
that, don't let it guess.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16
2018-06-02 7:42 ` Eli Zaretskii
@ 2018-06-02 13:55 ` Benjamin Riefenstahl
2018-06-02 14:24 ` Eli Zaretskii
0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Riefenstahl @ 2018-06-02 13:55 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 31679
Hi Eli,
>> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
>> (detect-coding-string "h\0t\0m\0l\0")
>>
>> And I was surprised that this does not detect UTF-16 but instead gives
>> (no-conversion).
Eli Zaretskii writes:
> First, you should lose the trailing null (or add one more), since
> UTF-16 strings must, by definition, have an even number of bytes.
Actually this string *has* 8 bytes, the last '\0' completes the 'l' to
form the two-byte character.
> Next, you should disable null byte detection by binding
> inhibit-null-byte-detection to a non-nil value, because otherwise
> Emacs's guesswork will prefer no-conversion, assuming this is binary
> data.
O.k. that is a good tip.
> Why? because it is perfectly valid for a plain-ASCII string to include
> null bytes, so Emacs prefers to guess ASCII.
While NUL is a valid ASCII character according to the standard,
practically nobody uses it as a character. So for a heuristic in this
context, it would be a bad decision to treat it just as another
character.
And indeed NUL bytes are treated as a strong indication of binary data,
it seems. I tried to debug this. The C routine detect_coding_utf_16
tries to distinguish between binary and UTF-16, but it is not called for
the string above. That routine is called OTOH, when I add a non-ASCII
character as in "h\0t\0m\0l\0ü\0", but even than it decides that the
string is not UTF-16 (?).
> Morale: detecting an encoding in Emacs is based on heuristic
> _guesswork_, which is heavily biased to what is deemed to be the most
> frequent use cases. And UTF-16 is quite infrequent, at least on Posix
> hosts.
>
> IOW, detecting encoding in Emacs is not as reliable as you seem to
> expect. If you _know_ the text is in UTF-16, just tell Emacs to use
> that, don't let it guess.
My use-case is that I am trying to paste types other than UTF8_STRING
from the X11 clipboard, and have them handled as automatically as
possible. While official clipboard types probably have a documented
encoding (and I have code for those), applications like Firefox also put
private formats there. And Firefox seems to like UTF-16, even the
text/html format it puts there is UTF-16.
I have tried to debug the C routines that implement this (s.a.), but the
code is somewhat hairy. I guess I'll have another look to see if I can
understand it better.
Thanks so far,
benny
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16
2018-06-02 13:55 ` Benjamin Riefenstahl
@ 2018-06-02 14:24 ` Eli Zaretskii
2021-08-12 13:51 ` Lars Ingebrigtsen
0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2018-06-02 14:24 UTC (permalink / raw)
To: Benjamin Riefenstahl; +Cc: 31679
> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
> Cc: 31679@debbugs.gnu.org
> Date: Sat, 02 Jun 2018 15:55:49 +0200
>
> > First, you should lose the trailing null (or add one more), since
> > UTF-16 strings must, by definition, have an even number of bytes.
>
> Actually this string *has* 8 bytes, the last '\0' completes the 'l' to
> form the two-byte character.
Oops. I guess I modified the string while playing with the example
and ended up with one more null.
> > Why? because it is perfectly valid for a plain-ASCII string to include
> > null bytes, so Emacs prefers to guess ASCII.
>
> While NUL is a valid ASCII character according to the standard,
> practically nobody uses it as a character. So for a heuristic in this
> context, it would be a bad decision to treat it just as another
> character.
That's because you _know_ this is supposed to be human-readable text,
made of non-null characters. But Emacs doesn't.
> And indeed NUL bytes are treated as a strong indication of binary data,
> it seems. I tried to debug this. The C routine detect_coding_utf_16
> tries to distinguish between binary and UTF-16, but it is not called for
> the string above. That routine is called OTOH, when I add a non-ASCII
> character as in "h\0t\0m\0l\0ü\0", but even than it decides that the
> string is not UTF-16 (?).
Don't forget that decoding is supposed to be fast, because it's
something Emacs does each time it visits a file or accepts input from
a subprocess. So it tries not to go through all the possible
encodings, but instead bails out as soon as it thinks it has found a
good guess.
> > Morale: detecting an encoding in Emacs is based on heuristic
> > _guesswork_, which is heavily biased to what is deemed to be the most
> > frequent use cases. And UTF-16 is quite infrequent, at least on Posix
> > hosts.
> >
> > IOW, detecting encoding in Emacs is not as reliable as you seem to
> > expect. If you _know_ the text is in UTF-16, just tell Emacs to use
> > that, don't let it guess.
>
> My use-case is that I am trying to paste types other than UTF8_STRING
> from the X11 clipboard, and have them handled as automatically as
> possible. While official clipboard types probably have a documented
> encoding (and I have code for those), applications like Firefox also put
> private formats there. And Firefox seems to like UTF-16, even the
> text/html format it puts there is UTF-16.
If you have a special application in mind, you could always write some
simple enough code in Lisp to see if UTF-16 should be tried, then tell
Emacs to try that explicitly.
> I have tried to debug the C routines that implement this (s.a.), but the
> code is somewhat hairy. I guess I'll have another look to see if I can
> understand it better.
We could add code to detect_coding_system that looks at some short
enough prefix of the text and sees whether there's a null byte there
for each non-null byte, and try UTF-16 if so. Assuming that we want
to improve the chances of having UTF-16 detected for a small penalty,
that is.
Thanks.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16
2018-06-02 14:24 ` Eli Zaretskii
@ 2021-08-12 13:51 ` Lars Ingebrigtsen
2021-09-09 15:23 ` Lars Ingebrigtsen
0 siblings, 1 reply; 6+ messages in thread
From: Lars Ingebrigtsen @ 2021-08-12 13:51 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 31679, Benjamin Riefenstahl
Eli Zaretskii <eliz@gnu.org> writes:
>> My use-case is that I am trying to paste types other than UTF8_STRING
>> from the X11 clipboard, and have them handled as automatically as
>> possible. While official clipboard types probably have a documented
>> encoding (and I have code for those), applications like Firefox also put
>> private formats there. And Firefox seems to like UTF-16, even the
>> text/html format it puts there is UTF-16.
>
> If you have a special application in mind, you could always write some
> simple enough code in Lisp to see if UTF-16 should be tried, then tell
> Emacs to try that explicitly.
I ran into the same issue when dealing with X selections -- but there's
even more peculiarities in that area (some selections add a spurious nul
to the end, and some done), so you have to write a bit of code around
this: `decode-coding-string' in itself can't be expected to deal/guess
all these oddities (as you say).
>> I have tried to debug the C routines that implement this (s.a.), but the
>> code is somewhat hairy. I guess I'll have another look to see if I can
>> understand it better.
>
> We could add code to detect_coding_system that looks at some short
> enough prefix of the text and sees whether there's a null byte there
> for each non-null byte, and try UTF-16 if so. Assuming that we want
> to improve the chances of having UTF-16 detected for a small penalty,
> that is.
I do think that, in general, it would be nice if detect_coding_system
did try a bit harder to guess at utf-16. For instance, if (in the first
X bytes of the string) more than 90% of the byte pairs look like
non-nul/nul pairs, then it's pretty likely to be utf-16. (And I think
that would be easy enough to implement?)
On the other hand, as you point out, there's a performance penalty that
may not be worth it.
So... uhm... does anybody have an opinion here? Try harder for utf-16
or just leave it as it is?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16
2021-08-12 13:51 ` Lars Ingebrigtsen
@ 2021-09-09 15:23 ` Lars Ingebrigtsen
0 siblings, 0 replies; 6+ messages in thread
From: Lars Ingebrigtsen @ 2021-09-09 15:23 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 31679, Benjamin Riefenstahl
Lars Ingebrigtsen <larsi@gnus.org> writes:
> On the other hand, as you point out, there's a performance penalty that
> may not be worth it.
>
> So... uhm... does anybody have an opinion here? Try harder for utf-16
> or just leave it as it is?
Nobody had an opinion in a month, so I'm closing this bug report.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-09-09 15:23 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-06-01 19:40 bug#31679: 26.1; detect-coding-string does not detect UTF-16 Benjamin Riefenstahl
2018-06-02 7:42 ` Eli Zaretskii
2018-06-02 13:55 ` Benjamin Riefenstahl
2018-06-02 14:24 ` Eli Zaretskii
2021-08-12 13:51 ` Lars Ingebrigtsen
2021-09-09 15:23 ` Lars Ingebrigtsen
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).