From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: coding tags and utf-16
Date: Tue, 28 Feb 2006 10:08:36 +0900
Message-ID: <E1FDtLw-0005XV-00@etlken>
References: <20051221.090033.182620434.wl@gnu.org>	<E1Eu2Ln-00032e-00@etlken>	<m1psn61xim.fsf-monnier+emacs@gnu.org>	<E1Eul7v-00056E-00@etlken>
	<85vewxodk2.fsf@lola.goethe.zz> <dse2i6$d2b$1@sea.gmane.org>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
X-Trace: sea.gmane.org 1141419112 1959 80.91.229.2 (3 Mar 2006 20:51:52 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Fri, 3 Mar 2006 20:51:52 +0000 (UTC)
Cc: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 03 21:51:48 2006
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1FFHFX-0003DM-TG
	for ged-emacs-devel@m.gmane.org; Fri, 03 Mar 2006 21:51:46 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1FFHFY-000730-21
	for ged-emacs-devel@m.gmane.org; Fri, 03 Mar 2006 15:51:44 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1FEqF3-00026P-LL
	for emacs-devel@gnu.org; Thu, 02 Mar 2006 11:01:26 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1FEqDv-000201-2l
	for emacs-devel@gnu.org; Thu, 02 Mar 2006 11:01:21 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1FEAll-0006ba-Qj
	for emacs-devel@gnu.org; Tue, 28 Feb 2006 14:44:26 -0500
Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.52) id 1FDtMv-0000lm-4o
	for emacs-devel@gnu.org; Mon, 27 Feb 2006 20:09:37 -0500
Original-Received: from nfs.m17n.org (nfs.m17n.org [192.47.44.7])
	by tsukuba.m17n.org (8.13.4/8.13.4/Debian-3) with ESMTP id
	k1S18bQR019723; Tue, 28 Feb 2006 10:08:37 +0900
Original-Received: from etlken (etlken.m17n.org [192.47.44.125])
	by nfs.m17n.org (8.13.4/8.13.4/Debian-3) with ESMTP id k1S18bCF002446; 
	Tue, 28 Feb 2006 10:08:37 +0900
Original-Received: from handa by etlken with local (Exim 3.36 #1 (Debian))
	id 1FDtLw-0005XV-00; Tue, 28 Feb 2006 10:08:36 +0900
Original-To: Kevin Rodgers <ihs_4664@yahoo.com>
In-reply-to: <dse2i6$d2b$1@sea.gmane.org> (message from Kevin Rodgers on Wed, 
	08 Feb 2006 17:32:02 -0700)
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2
	Emacs/22.0.50 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:51089
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/51089>

Sorry for the late responce.

In article <dse2i6$d2b$1@sea.gmane.org>, Kevin Rodgers <ihs_4664@yahoo.com> writes:

>> I thought we had discussed this already.  The BOM-encodings should
>> have priority since the likelihood of a misdetection is negligible
>> (the character pair does not make sense at the start of a text in
>> latin-1 in any language): the only thing that can reasonably be
>> expected to happen is that a binary file is detected as utf-16.  Not
>> much of an issue, I'd say.

I've just digged out old mails we exchanged on this topic
(about a year ago).  To my understanding, there was no
clear conclusion.  Here are the extracts:
------------------------------------------------------------
I wrote:
> I think BOM is not that safe because there are many charsets
> who have normal letters at 0xFE and 0xFF.

Jason wrote:
> But what are those characters, and are they likely to appear as a pair 
> at the beginning of the file, and nowhere else?

I wrote:
> Sorry, I don't know.

Dave wrote:
>> Exactly what Windows does for what?  Recognizing a utf-16 registry
>> file when opened in the registry editor?

> Auto-detecting utf-16 generally.  Although I don't think it would give
> false positives on iso-8859 text, I don't know if it could with other
> charsets.
> 
> I could believe that Windows doesn't just go by byte-order-mark in
> some locales where there might be a problem.  If so, it could be
> useful to do the same thing.
------------------------------------------------------------

For instance, I've just googled the two character sequence
of 0xFE 0xFF of koi8 and found several occurrences.

> Exactly.  So why haven't these entries been added to 
> auto-coding-regexp-alist?

> ("\\`\xEF\xBB\xBF" . utf-8)

As far as I know, UTF-8 should not start with this sequence
unless the text really starts with ZWNBSP (very unlikely).

> ("\\`\xFE\xFF" . utf-16-be)
> ("\\`\xFF\xFE" . utf-16-le)

Although it's not clear how safe they are, if no one objects,
I'll add them in auto-coding-regexp-alist.

> ("\\`\x00\x00\xFE\xFF" . utf-32-be)
> ("\\`\xFF\xFE\x00\x00" . utf-32-le)

Emacs doesn't support those encoding for the momemnt.

---
Kenichi Handa
handa@m17n.org