* bug#31679: 26.1; detect-coding-string does not detect UTF-16 @ 2018-06-01 19:40 Benjamin Riefenstahl 2018-06-02 7:42 ` Eli Zaretskii 0 siblings, 1 reply; 6+ messages in thread From: Benjamin Riefenstahl @ 2018-06-01 19:40 UTC (permalink / raw) To: 31679 I have been trying this (in real life the strings are often longer, of course): (detect-coding-string "h\0t\0m\0l\0") And I was surprised that this does not detect UTF-16 but instead gives (no-conversion). The result of (coding-system-priority-list) is (utf-8 iso-2022-7bit iso-latin-1 iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis chinese-big5 undecided) Does this just not work, or am I doing something wrong? Thanks, benny Recent messages: For information about GNU Emacs and the GNU system, type C-h C-a. next-line: End of buffer (no-conversion) Quit [2 times] Type C-x 1 to delete the help window. Mark set delete-backward-char: Text is read-only Configured features: XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND GPM GSETTINGS NOTIFY LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB TOOLKIT_SCROLL_BARS GTK2 X11 THREADS LCMS2 Important settings: value of $LANG: en_US.UTF-8 locale-coding-system: utf-8-unix Major mode: Lisp Interaction Minor modes in effect: tooltip-mode: t global-eldoc-mode: t eldoc-mode: t electric-indent-mode: t mouse-wheel-mode: t tool-bar-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t line-number-mode: t transient-mark-mode: t Load-path shadows: None found. Features: (shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv bytecomp byte-compile cconv dired dired-loaddefs format-spec rfc822 mml mml-sec password-cache epa derived epg epg-config gnus-util rmail rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils cl-extra help-fns radix-tree help-mode easymenu cl-loaddefs cl-lib term/xterm xterm time-date elec-pair mule-util tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type mwheel term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list replace newcomment text-mode elisp-mode lisp-mode prog-mode register page menu-bar rfn-eshadow isearch timer select scroll-bar mouse jit-lock font-lock syntax facemenu font-core term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese composite charscript charprop case-table epa-hook jka-cmpr-hook help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs button faces cus-face macroexp files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote inotify lcms2 dynamic-setting system-font-setting font-render-setting move-toolbar gtk x-toolkit x multi-tty make-network-process emacs) Memory information: ((conses 8 102532 5281) (symbols 24 20919 1) (miscs 20 38 212) (strings 16 29808 1314) (string-bytes 1 767826) (vectors 12 12354) (vector-slots 4 470678 7618) (floats 8 56 559) (intervals 28 260 1) (buffers 536 12) (heap 1024 30861 580)) ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16 2018-06-01 19:40 bug#31679: 26.1; detect-coding-string does not detect UTF-16 Benjamin Riefenstahl @ 2018-06-02 7:42 ` Eli Zaretskii 2018-06-02 13:55 ` Benjamin Riefenstahl 0 siblings, 1 reply; 6+ messages in thread From: Eli Zaretskii @ 2018-06-02 7:42 UTC (permalink / raw) To: Benjamin Riefenstahl; +Cc: 31679 > From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> > Date: Fri, 01 Jun 2018 21:40:32 +0200 > > I have been trying this (in real life the strings are often longer, of > course): > > (detect-coding-string "h\0t\0m\0l\0") > > And I was surprised that this does not detect UTF-16 but instead gives > (no-conversion). First, you should lose the trailing null (or add one more), since UTF-16 strings must, by definition, have an even number of bytes. Next, you should disable null byte detection by binding inhibit-null-byte-detection to a non-nil value, because otherwise Emacs's guesswork will prefer no-conversion, assuming this is binary data. If you do that, you get (let ((inhibit-null-byte-detection t)) (detect-coding-string "h\0t\0m\0l")) => (undecided) Why? because it is perfectly valid for a plain-ASCII string to include null bytes, so Emacs prefers to guess ASCII. As another example, try this: (prefer-coding-system 'utf-16) (let ((inhibit-null-byte-detection t)) (detect-coding-string (encode-coding-string "áçðë" 'utf-16-be) t)) => utf-16 but (let ((inhibit-null-byte-detection t)) (detect-coding-string (substring (encode-coding-string "áçðë" 'utf-16-be) 2) t)) =>iso-latin-1 So even when UTF-16 is the most preferred encoding, just removing the BOM is enough to let Emacs prefer something other than UTF-16. Morale: detecting an encoding in Emacs is based on heuristic _guesswork_, which is heavily biased to what is deemed to be the most frequent use cases. And UTF-16 is quite infrequent, at least on Posix hosts. IOW, detecting encoding in Emacs is not as reliable as you seem to expect. If you _know_ the text is in UTF-16, just tell Emacs to use that, don't let it guess. ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16 2018-06-02 7:42 ` Eli Zaretskii @ 2018-06-02 13:55 ` Benjamin Riefenstahl 2018-06-02 14:24 ` Eli Zaretskii 0 siblings, 1 reply; 6+ messages in thread From: Benjamin Riefenstahl @ 2018-06-02 13:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 31679 Hi Eli, >> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> >> (detect-coding-string "h\0t\0m\0l\0") >> >> And I was surprised that this does not detect UTF-16 but instead gives >> (no-conversion). Eli Zaretskii writes: > First, you should lose the trailing null (or add one more), since > UTF-16 strings must, by definition, have an even number of bytes. Actually this string *has* 8 bytes, the last '\0' completes the 'l' to form the two-byte character. > Next, you should disable null byte detection by binding > inhibit-null-byte-detection to a non-nil value, because otherwise > Emacs's guesswork will prefer no-conversion, assuming this is binary > data. O.k. that is a good tip. > Why? because it is perfectly valid for a plain-ASCII string to include > null bytes, so Emacs prefers to guess ASCII. While NUL is a valid ASCII character according to the standard, practically nobody uses it as a character. So for a heuristic in this context, it would be a bad decision to treat it just as another character. And indeed NUL bytes are treated as a strong indication of binary data, it seems. I tried to debug this. The C routine detect_coding_utf_16 tries to distinguish between binary and UTF-16, but it is not called for the string above. That routine is called OTOH, when I add a non-ASCII character as in "h\0t\0m\0l\0ü\0", but even than it decides that the string is not UTF-16 (?). > Morale: detecting an encoding in Emacs is based on heuristic > _guesswork_, which is heavily biased to what is deemed to be the most > frequent use cases. And UTF-16 is quite infrequent, at least on Posix > hosts. > > IOW, detecting encoding in Emacs is not as reliable as you seem to > expect. If you _know_ the text is in UTF-16, just tell Emacs to use > that, don't let it guess. My use-case is that I am trying to paste types other than UTF8_STRING from the X11 clipboard, and have them handled as automatically as possible. While official clipboard types probably have a documented encoding (and I have code for those), applications like Firefox also put private formats there. And Firefox seems to like UTF-16, even the text/html format it puts there is UTF-16. I have tried to debug the C routines that implement this (s.a.), but the code is somewhat hairy. I guess I'll have another look to see if I can understand it better. Thanks so far, benny ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16 2018-06-02 13:55 ` Benjamin Riefenstahl @ 2018-06-02 14:24 ` Eli Zaretskii 2021-08-12 13:51 ` Lars Ingebrigtsen 0 siblings, 1 reply; 6+ messages in thread From: Eli Zaretskii @ 2018-06-02 14:24 UTC (permalink / raw) To: Benjamin Riefenstahl; +Cc: 31679 > From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> > Cc: 31679@debbugs.gnu.org > Date: Sat, 02 Jun 2018 15:55:49 +0200 > > > First, you should lose the trailing null (or add one more), since > > UTF-16 strings must, by definition, have an even number of bytes. > > Actually this string *has* 8 bytes, the last '\0' completes the 'l' to > form the two-byte character. Oops. I guess I modified the string while playing with the example and ended up with one more null. > > Why? because it is perfectly valid for a plain-ASCII string to include > > null bytes, so Emacs prefers to guess ASCII. > > While NUL is a valid ASCII character according to the standard, > practically nobody uses it as a character. So for a heuristic in this > context, it would be a bad decision to treat it just as another > character. That's because you _know_ this is supposed to be human-readable text, made of non-null characters. But Emacs doesn't. > And indeed NUL bytes are treated as a strong indication of binary data, > it seems. I tried to debug this. The C routine detect_coding_utf_16 > tries to distinguish between binary and UTF-16, but it is not called for > the string above. That routine is called OTOH, when I add a non-ASCII > character as in "h\0t\0m\0l\0ü\0", but even than it decides that the > string is not UTF-16 (?). Don't forget that decoding is supposed to be fast, because it's something Emacs does each time it visits a file or accepts input from a subprocess. So it tries not to go through all the possible encodings, but instead bails out as soon as it thinks it has found a good guess. > > Morale: detecting an encoding in Emacs is based on heuristic > > _guesswork_, which is heavily biased to what is deemed to be the most > > frequent use cases. And UTF-16 is quite infrequent, at least on Posix > > hosts. > > > > IOW, detecting encoding in Emacs is not as reliable as you seem to > > expect. If you _know_ the text is in UTF-16, just tell Emacs to use > > that, don't let it guess. > > My use-case is that I am trying to paste types other than UTF8_STRING > from the X11 clipboard, and have them handled as automatically as > possible. While official clipboard types probably have a documented > encoding (and I have code for those), applications like Firefox also put > private formats there. And Firefox seems to like UTF-16, even the > text/html format it puts there is UTF-16. If you have a special application in mind, you could always write some simple enough code in Lisp to see if UTF-16 should be tried, then tell Emacs to try that explicitly. > I have tried to debug the C routines that implement this (s.a.), but the > code is somewhat hairy. I guess I'll have another look to see if I can > understand it better. We could add code to detect_coding_system that looks at some short enough prefix of the text and sees whether there's a null byte there for each non-null byte, and try UTF-16 if so. Assuming that we want to improve the chances of having UTF-16 detected for a small penalty, that is. Thanks. ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16 2018-06-02 14:24 ` Eli Zaretskii @ 2021-08-12 13:51 ` Lars Ingebrigtsen 2021-09-09 15:23 ` Lars Ingebrigtsen 0 siblings, 1 reply; 6+ messages in thread From: Lars Ingebrigtsen @ 2021-08-12 13:51 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 31679, Benjamin Riefenstahl Eli Zaretskii <eliz@gnu.org> writes: >> My use-case is that I am trying to paste types other than UTF8_STRING >> from the X11 clipboard, and have them handled as automatically as >> possible. While official clipboard types probably have a documented >> encoding (and I have code for those), applications like Firefox also put >> private formats there. And Firefox seems to like UTF-16, even the >> text/html format it puts there is UTF-16. > > If you have a special application in mind, you could always write some > simple enough code in Lisp to see if UTF-16 should be tried, then tell > Emacs to try that explicitly. I ran into the same issue when dealing with X selections -- but there's even more peculiarities in that area (some selections add a spurious nul to the end, and some done), so you have to write a bit of code around this: `decode-coding-string' in itself can't be expected to deal/guess all these oddities (as you say). >> I have tried to debug the C routines that implement this (s.a.), but the >> code is somewhat hairy. I guess I'll have another look to see if I can >> understand it better. > > We could add code to detect_coding_system that looks at some short > enough prefix of the text and sees whether there's a null byte there > for each non-null byte, and try UTF-16 if so. Assuming that we want > to improve the chances of having UTF-16 detected for a small penalty, > that is. I do think that, in general, it would be nice if detect_coding_system did try a bit harder to guess at utf-16. For instance, if (in the first X bytes of the string) more than 90% of the byte pairs look like non-nul/nul pairs, then it's pretty likely to be utf-16. (And I think that would be easy enough to implement?) On the other hand, as you point out, there's a performance penalty that may not be worth it. So... uhm... does anybody have an opinion here? Try harder for utf-16 or just leave it as it is? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#31679: 26.1; detect-coding-string does not detect UTF-16 2021-08-12 13:51 ` Lars Ingebrigtsen @ 2021-09-09 15:23 ` Lars Ingebrigtsen 0 siblings, 0 replies; 6+ messages in thread From: Lars Ingebrigtsen @ 2021-09-09 15:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 31679, Benjamin Riefenstahl Lars Ingebrigtsen <larsi@gnus.org> writes: > On the other hand, as you point out, there's a performance penalty that > may not be worth it. > > So... uhm... does anybody have an opinion here? Try harder for utf-16 > or just leave it as it is? Nobody had an opinion in a month, so I'm closing this bug report. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-09-09 15:23 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-06-01 19:40 bug#31679: 26.1; detect-coding-string does not detect UTF-16 Benjamin Riefenstahl 2018-06-02 7:42 ` Eli Zaretskii 2018-06-02 13:55 ` Benjamin Riefenstahl 2018-06-02 14:24 ` Eli Zaretskii 2021-08-12 13:51 ` Lars Ingebrigtsen 2021-09-09 15:23 ` Lars Ingebrigtsen
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).