From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Date: Sat, 02 Jun 2018 17:24:19 +0300 Message-ID: <836031e06k.fsf@gnu.org> References: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> <83zi0deish.fsf@gnu.org> <874lilgumy.fsf@blei.turtle-trading.net> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1527949390 22721 195.159.176.226 (2 Jun 2018 14:23:10 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 2 Jun 2018 14:23:10 +0000 (UTC) Cc: 31679@debbugs.gnu.org To: Benjamin Riefenstahl Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Jun 02 16:23:06 2018 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP7Qu-0005jd-B7 for geb-bug-gnu-emacs@m.gmane.org; Sat, 02 Jun 2018 16:23:05 +0200 Original-Received: from localhost ([::1]:60018 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP7Sz-00046h-Hf for geb-bug-gnu-emacs@m.gmane.org; Sat, 02 Jun 2018 10:25:13 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44272) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP7Sr-00046X-43 for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 10:25:06 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fP7Sn-0000mv-V0 for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 10:25:05 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:51568) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fP7Sn-0000mi-Rp for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 10:25:01 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1fP7Sn-0006kv-L7 for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 10:25:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 02 Jun 2018 14:25:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 31679-submit@debbugs.gnu.org id=B31679.152794946725888 (code B ref 31679); Sat, 02 Jun 2018 14:25:01 +0000 Original-Received: (at 31679) by debbugs.gnu.org; 2 Jun 2018 14:24:27 +0000 Original-Received: from localhost ([127.0.0.1]:59465 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP7SF-0006jT-JD for submit@debbugs.gnu.org; Sat, 02 Jun 2018 10:24:27 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:53161) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP7SC-0006j9-Vx for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 10:24:26 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fP7S4-0008FF-NQ for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 10:24:19 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:46680) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP7S4-0008Ez-JE; Sat, 02 Jun 2018 10:24:16 -0400 Original-Received: from [176.228.60.248] (port=2292 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1fP7S4-00064b-0Q; Sat, 02 Jun 2018 10:24:16 -0400 In-reply-to: <874lilgumy.fsf@blei.turtle-trading.net> (message from Benjamin Riefenstahl on Sat, 02 Jun 2018 15:55:49 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:146868 Archived-At: > From: Benjamin Riefenstahl > Cc: 31679@debbugs.gnu.org > Date: Sat, 02 Jun 2018 15:55:49 +0200 > > > First, you should lose the trailing null (or add one more), since > > UTF-16 strings must, by definition, have an even number of bytes. > > Actually this string *has* 8 bytes, the last '\0' completes the 'l' to > form the two-byte character. Oops. I guess I modified the string while playing with the example and ended up with one more null. > > Why? because it is perfectly valid for a plain-ASCII string to include > > null bytes, so Emacs prefers to guess ASCII. > > While NUL is a valid ASCII character according to the standard, > practically nobody uses it as a character. So for a heuristic in this > context, it would be a bad decision to treat it just as another > character. That's because you _know_ this is supposed to be human-readable text, made of non-null characters. But Emacs doesn't. > And indeed NUL bytes are treated as a strong indication of binary data, > it seems. I tried to debug this. The C routine detect_coding_utf_16 > tries to distinguish between binary and UTF-16, but it is not called for > the string above. That routine is called OTOH, when I add a non-ASCII > character as in "h\0t\0m\0l\0ü\0", but even than it decides that the > string is not UTF-16 (?). Don't forget that decoding is supposed to be fast, because it's something Emacs does each time it visits a file or accepts input from a subprocess. So it tries not to go through all the possible encodings, but instead bails out as soon as it thinks it has found a good guess. > > Morale: detecting an encoding in Emacs is based on heuristic > > _guesswork_, which is heavily biased to what is deemed to be the most > > frequent use cases. And UTF-16 is quite infrequent, at least on Posix > > hosts. > > > > IOW, detecting encoding in Emacs is not as reliable as you seem to > > expect. If you _know_ the text is in UTF-16, just tell Emacs to use > > that, don't let it guess. > > My use-case is that I am trying to paste types other than UTF8_STRING > from the X11 clipboard, and have them handled as automatically as > possible. While official clipboard types probably have a documented > encoding (and I have code for those), applications like Firefox also put > private formats there. And Firefox seems to like UTF-16, even the > text/html format it puts there is UTF-16. If you have a special application in mind, you could always write some simple enough code in Lisp to see if UTF-16 should be tried, then tell Emacs to try that explicitly. > I have tried to debug the C routines that implement this (s.a.), but the > code is somewhat hairy. I guess I'll have another look to see if I can > understand it better. We could add code to detect_coding_system that looks at some short enough prefix of the text and sees whether there's a null byte there for each non-null byte, and try UTF-16 if so. Assuming that we want to improve the chances of having UTF-16 detected for a small penalty, that is. Thanks.