From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Kevin Rodgers <ihs_4664@yahoo.com>
Newsgroups: gmane.emacs.devel
Subject: Re: coding tags and utf-16
Date: Wed, 08 Feb 2006 17:32:02 -0700
Message-ID: <dse2i6$d2b$1@sea.gmane.org>
References: <20051221.090033.182620434.wl@gnu.org>
	<E1Eu2Ln-00032e-00@etlken>	<m1psn61xim.fsf-monnier+emacs@gnu.org>
	<E1Eul7v-00056E-00@etlken> <85vewxodk2.fsf@lola.goethe.zz>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: sea.gmane.org 1139449442 25083 80.91.229.2 (9 Feb 2006 01:44:02 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Thu, 9 Feb 2006 01:44:02 +0000 (UTC)
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Feb 09 02:44:01 2006
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1F70qf-0003V0-CH
	for ged-emacs-devel@m.gmane.org; Thu, 09 Feb 2006 02:43:54 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1F70qe-0001Nj-ST
	for ged-emacs-devel@m.gmane.org; Wed, 08 Feb 2006 20:43:52 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1F6zjj-0006Kl-VN
	for emacs-devel@gnu.org; Wed, 08 Feb 2006 19:32:41 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1F6zjf-0006K1-Bp
	for emacs-devel@gnu.org; Wed, 08 Feb 2006 19:32:35 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1F6zjZ-0006Ij-9N
	for emacs-devel@gnu.org; Wed, 08 Feb 2006 19:32:30 -0500
Original-Received: from [80.91.229.2] (helo=ciao.gmane.org)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA:32)
	(Exim 4.52) id 1F6zmz-00083S-Ee
	for emacs-devel@gnu.org; Wed, 08 Feb 2006 19:36:01 -0500
Original-Received: from list by ciao.gmane.org with local (Exim 4.43)
	id 1F6zjP-0007NL-Ek
	for emacs-devel@gnu.org; Thu, 09 Feb 2006 01:32:19 +0100
Original-Received: from 207.167.42.60 ([207.167.42.60])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <emacs-devel@gnu.org>; Thu, 09 Feb 2006 01:32:19 +0100
Original-Received: from ihs_4664 by 207.167.42.60 with local (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <emacs-devel@gnu.org>; Thu, 09 Feb 2006 01:32:19 +0100
X-Injected-Via-Gmane: http://gmane.org/
Original-To: emacs-devel@gnu.org
Original-Lines: 42
Original-X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: 207.167.42.60
User-Agent: Mozilla Thunderbird 0.9 (X11/20041105)
X-Accept-Language: en-us, en
In-Reply-To: <85vewxodk2.fsf@lola.goethe.zz>
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:50229
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/50229>

David Kastrup wrote:
> Kenichi Handa <handa@m17n.org> writes: 
> 
>>In article <m1psn61xim.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>
>>>> So, in any cases, a tag value itself is useless.  Then how
>>>> to detect utf-16 more reliably?  In the current Emacs
>>>> (i.e. Ver.22), I think we can use auto-coding-regexp-alist
>>>> or auto-coding-alist.  In the former case, we can register
>>>> BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+"
>>>> for utf-16be.  In the latter case, you can use more
>>>> complicated heuristics in a registered function.
>>
>>>Can't it be somehow added to detect_coding_utf_16?
>>
>>Yes, but usually it has no effect if, for instance,
>>iso-8859-1 is more preferred.  If only ASCII and Latin-1
>>characters are encoded in utf-16, all bytes (including BOM)
>>are valid for iso-8859-1.
> 
> I thought we had discussed this already.  The BOM-encodings should
> have priority since the likelihood of a misdetection is negligible
> (the character pair does not make sense at the start of a text in
> latin-1 in any language): the only thing that can reasonably be
> expected to happen is that a binary file is detected as utf-16.  Not
> much of an issue, I'd say.

Exactly.  So why haven't these entries been added to 
auto-coding-regexp-alist?

("\\`\xEF\xBB\xBF" . utf-8)
("\\`\xFE\xFF" . utf-16-be)
("\\`\xFF\xFE" . utf-16-le)
("\\`\x00\x00\xFE\xFF" . utf-32-be)
("\\`\xFF\xFE\x00\x00" . utf-32-le)

> Of course, for the BOM-less utf-16 encodings, priority should depend
> on the language environment.

Definitely.
-- 
Kevin Rodgers