From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Date: Sat, 02 Jun 2018 10:42:22 +0300 Message-ID: <83zi0deish.fsf@gnu.org> References: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1527925266 17131 195.159.176.226 (2 Jun 2018 07:41:06 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 2 Jun 2018 07:41:06 +0000 (UTC) Cc: 31679@debbugs.gnu.org To: Benjamin Riefenstahl Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Jun 02 09:41:02 2018 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP19q-0004Km-9f for geb-bug-gnu-emacs@m.gmane.org; Sat, 02 Jun 2018 09:41:02 +0200 Original-Received: from localhost ([::1]:58622 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP1Bx-0007Cd-5q for geb-bug-gnu-emacs@m.gmane.org; Sat, 02 Jun 2018 03:43:13 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56230) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP1Bq-0007CQ-3I for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 03:43:07 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fP1Bm-0006Rz-6H for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 03:43:06 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:50445) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fP1Bm-0006RY-2U for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 03:43:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1fP1Bl-0002oo-Oh for bug-gnu-emacs@gnu.org; Sat, 02 Jun 2018 03:43:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 02 Jun 2018 07:43:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 31679-submit@debbugs.gnu.org id=B31679.152792535410802 (code B ref 31679); Sat, 02 Jun 2018 07:43:01 +0000 Original-Received: (at 31679) by debbugs.gnu.org; 2 Jun 2018 07:42:34 +0000 Original-Received: from localhost ([127.0.0.1]:58342 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP1BK-0002o9-2w for submit@debbugs.gnu.org; Sat, 02 Jun 2018 03:42:34 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:36906) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP1BG-0002nv-VH for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 03:42:32 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fP1B7-0005BE-Qb for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 03:42:25 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:42511) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP1B7-0005Ax-Lq; Sat, 02 Jun 2018 03:42:21 -0400 Original-Received: from [176.228.60.248] (port=1585 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1fP1B5-0005cM-QW; Sat, 02 Jun 2018 03:42:20 -0400 In-reply-to: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> (message from Benjamin Riefenstahl on Fri, 01 Jun 2018 21:40:32 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:146842 Archived-At: > From: Benjamin Riefenstahl > Date: Fri, 01 Jun 2018 21:40:32 +0200 > > I have been trying this (in real life the strings are often longer, of > course): > > (detect-coding-string "h\0t\0m\0l\0") > > And I was surprised that this does not detect UTF-16 but instead gives > (no-conversion). First, you should lose the trailing null (or add one more), since UTF-16 strings must, by definition, have an even number of bytes. Next, you should disable null byte detection by binding inhibit-null-byte-detection to a non-nil value, because otherwise Emacs's guesswork will prefer no-conversion, assuming this is binary data. If you do that, you get (let ((inhibit-null-byte-detection t)) (detect-coding-string "h\0t\0m\0l")) => (undecided) Why? because it is perfectly valid for a plain-ASCII string to include null bytes, so Emacs prefers to guess ASCII. As another example, try this: (prefer-coding-system 'utf-16) (let ((inhibit-null-byte-detection t)) (detect-coding-string (encode-coding-string "áçðë" 'utf-16-be) t)) => utf-16 but (let ((inhibit-null-byte-detection t)) (detect-coding-string (substring (encode-coding-string "áçðë" 'utf-16-be) 2) t)) =>iso-latin-1 So even when UTF-16 is the most preferred encoding, just removing the BOM is enough to let Emacs prefer something other than UTF-16. Morale: detecting an encoding in Emacs is based on heuristic _guesswork_, which is heavily biased to what is deemed to be the most frequent use cases. And UTF-16 is quite infrequent, at least on Posix hosts. IOW, detecting encoding in Emacs is not as reliable as you seem to expect. If you _know_ the text is in UTF-16, just tell Emacs to use that, don't let it guess.