From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kevin Rodgers Newsgroups: gmane.emacs.devel Subject: Re: coding tags and utf-16 Date: Wed, 08 Feb 2006 17:32:02 -0700 Message-ID: References: <20051221.090033.182620434.wl@gnu.org> <85vewxodk2.fsf@lola.goethe.zz> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1139449442 25083 80.91.229.2 (9 Feb 2006 01:44:02 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 9 Feb 2006 01:44:02 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Feb 09 02:44:01 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1F70qf-0003V0-CH for ged-emacs-devel@m.gmane.org; Thu, 09 Feb 2006 02:43:54 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1F70qe-0001Nj-ST for ged-emacs-devel@m.gmane.org; Wed, 08 Feb 2006 20:43:52 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1F6zjj-0006Kl-VN for emacs-devel@gnu.org; Wed, 08 Feb 2006 19:32:41 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1F6zjf-0006K1-Bp for emacs-devel@gnu.org; Wed, 08 Feb 2006 19:32:35 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1F6zjZ-0006Ij-9N for emacs-devel@gnu.org; Wed, 08 Feb 2006 19:32:30 -0500 Original-Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.52) id 1F6zmz-00083S-Ee for emacs-devel@gnu.org; Wed, 08 Feb 2006 19:36:01 -0500 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1F6zjP-0007NL-Ek for emacs-devel@gnu.org; Thu, 09 Feb 2006 01:32:19 +0100 Original-Received: from 207.167.42.60 ([207.167.42.60]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 09 Feb 2006 01:32:19 +0100 Original-Received: from ihs_4664 by 207.167.42.60 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 09 Feb 2006 01:32:19 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-To: emacs-devel@gnu.org Original-Lines: 42 Original-X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: 207.167.42.60 User-Agent: Mozilla Thunderbird 0.9 (X11/20041105) X-Accept-Language: en-us, en In-Reply-To: <85vewxodk2.fsf@lola.goethe.zz> X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:50229 Archived-At: David Kastrup wrote: > Kenichi Handa writes: > >>In article , Stefan Monnier writes: >> >>>> So, in any cases, a tag value itself is useless. Then how >>>> to detect utf-16 more reliably? In the current Emacs >>>> (i.e. Ver.22), I think we can use auto-coding-regexp-alist >>>> or auto-coding-alist. In the former case, we can register >>>> BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+" >>>> for utf-16be. In the latter case, you can use more >>>> complicated heuristics in a registered function. >> >>>Can't it be somehow added to detect_coding_utf_16? >> >>Yes, but usually it has no effect if, for instance, >>iso-8859-1 is more preferred. If only ASCII and Latin-1 >>characters are encoded in utf-16, all bytes (including BOM) >>are valid for iso-8859-1. > > I thought we had discussed this already. The BOM-encodings should > have priority since the likelihood of a misdetection is negligible > (the character pair does not make sense at the start of a text in > latin-1 in any language): the only thing that can reasonably be > expected to happen is that a binary file is detected as utf-16. Not > much of an issue, I'd say. Exactly. So why haven't these entries been added to auto-coding-regexp-alist? ("\\`\xEF\xBB\xBF" . utf-8) ("\\`\xFE\xFF" . utf-16-be) ("\\`\xFF\xFE" . utf-16-le) ("\\`\x00\x00\xFE\xFF" . utf-32-be) ("\\`\xFF\xFE\x00\x00" . utf-32-le) > Of course, for the BOM-less utf-16 encodings, priority should depend > on the language environment. Definitely. -- Kevin Rodgers