From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Benjamin Riefenstahl Newsgroups: gmane.emacs.bugs Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Date: Sat, 02 Jun 2018 15:55:49 +0200 Message-ID: <874lilgumy.fsf@blei.turtle-trading.net> References: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> <83zi0deish.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1527947651 26515 195.159.176.226 (2 Jun 2018 13:54:11 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 2 Jun 2018 13:54:11 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) Cc: 31679@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Jun 02 15:54:07 2018 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP6yr-0006pd-Uf for geb-bug-gnu-emacs@m.gmane.org; Sat, 02 Jun 2018 15:54:06 +0200 Original-Received: from localhost ([::1]:59913 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP70y-0005wK-NK for geb-bug-gnu-emacs@m.gmane.org; Sat, 02 Jun 2018 09:56:16 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39511) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP70l-0005pF-BL for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 09:56:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fP70k-0008Ly-CU for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 09:56:03 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:51532) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fP70k-0008LM-8O for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 09:56:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1fP70j-0005u2-Qi for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 09:56:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Benjamin Riefenstahl Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 02 Jun 2018 13:56:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 31679-submit@debbugs.gnu.org id=B31679.152794775822683 (code B ref 31679); Sat, 02 Jun 2018 13:56:01 +0000 Original-Received: (at 31679) by debbugs.gnu.org; 2 Jun 2018 13:55:58 +0000 Original-Received: from localhost ([127.0.0.1]:59429 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP70g-0005tn-EM for submit@debbugs.gnu.org; Sat, 02 Jun 2018 09:55:58 -0400 Original-Received: from odoacer.turtle-trading.net ([217.91.34.180]:34146) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP70e-0005tg-2n for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 09:55:56 -0400 Original-Received: from justinian.turtle-trading.net ([192.168.2.118]) by odoacer.turtle-trading.net with esmtp (Exim 4.80) (envelope-from ) id 1fP70X-00016L-Ly; Sat, 02 Jun 2018 15:55:49 +0200 Original-Received: from benny by justinian.turtle-trading.net with local (Exim 4.84_2) (envelope-from ) id 1fP70X-0002tC-Iv; Sat, 02 Jun 2018 15:55:49 +0200 In-Reply-To: <83zi0deish.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 02 Jun 2018 10:42:22 +0300") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:146864 Archived-At: Hi Eli, >> From: Benjamin Riefenstahl >> (detect-coding-string "h\0t\0m\0l\0") >>=20 >> And I was surprised that this does not detect UTF-16 but instead gives >> (no-conversion). Eli Zaretskii writes: > First, you should lose the trailing null (or add one more), since > UTF-16 strings must, by definition, have an even number of bytes. Actually this string *has* 8 bytes, the last '\0' completes the 'l' to form the two-byte character. > Next, you should disable null byte detection by binding > inhibit-null-byte-detection to a non-nil value, because otherwise > Emacs's guesswork will prefer no-conversion, assuming this is binary > data. O.k. that is a good tip.=20 > Why? because it is perfectly valid for a plain-ASCII string to include > null bytes, so Emacs prefers to guess ASCII. While NUL is a valid ASCII character according to the standard, practically nobody uses it as a character. So for a heuristic in this context, it would be a bad decision to treat it just as another character. And indeed NUL bytes are treated as a strong indication of binary data, it seems. I tried to debug this. The C routine detect_coding_utf_16 tries to distinguish between binary and UTF-16, but it is not called for the string above. That routine is called OTOH, when I add a non-ASCII character as in "h\0t\0m\0l\0=FC\0", but even than it decides that the string is not UTF-16 (?). > Morale: detecting an encoding in Emacs is based on heuristic > _guesswork_, which is heavily biased to what is deemed to be the most > frequent use cases. And UTF-16 is quite infrequent, at least on Posix > hosts. > > IOW, detecting encoding in Emacs is not as reliable as you seem to > expect. If you _know_ the text is in UTF-16, just tell Emacs to use > that, don't let it guess. My use-case is that I am trying to paste types other than UTF8_STRING from the X11 clipboard, and have them handled as automatically as possible. While official clipboard types probably have a documented encoding (and I have code for those), applications like Firefox also put private formats there. And Firefox seems to like UTF-16, even the text/html format it puts there is UTF-16. I have tried to debug the C routines that implement this (s.a.), but the code is somewhat hairy. I guess I'll have another look to see if I can understand it better. Thanks so far, benny