From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: coding tags and utf-16 Date: Tue, 28 Feb 2006 10:08:36 +0900 Message-ID: References: <20051221.090033.182620434.wl@gnu.org> <85vewxodk2.fsf@lola.goethe.zz> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: sea.gmane.org 1141419112 1959 80.91.229.2 (3 Mar 2006 20:51:52 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Fri, 3 Mar 2006 20:51:52 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 03 21:51:48 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1FFHFX-0003DM-TG for ged-emacs-devel@m.gmane.org; Fri, 03 Mar 2006 21:51:46 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FFHFY-000730-21 for ged-emacs-devel@m.gmane.org; Fri, 03 Mar 2006 15:51:44 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1FEqF3-00026P-LL for emacs-devel@gnu.org; Thu, 02 Mar 2006 11:01:26 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1FEqDv-000201-2l for emacs-devel@gnu.org; Thu, 02 Mar 2006 11:01:21 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FEAll-0006ba-Qj for emacs-devel@gnu.org; Tue, 28 Feb 2006 14:44:26 -0500 Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.52) id 1FDtMv-0000lm-4o for emacs-devel@gnu.org; Mon, 27 Feb 2006 20:09:37 -0500 Original-Received: from nfs.m17n.org (nfs.m17n.org [192.47.44.7]) by tsukuba.m17n.org (8.13.4/8.13.4/Debian-3) with ESMTP id k1S18bQR019723; Tue, 28 Feb 2006 10:08:37 +0900 Original-Received: from etlken (etlken.m17n.org [192.47.44.125]) by nfs.m17n.org (8.13.4/8.13.4/Debian-3) with ESMTP id k1S18bCF002446; Tue, 28 Feb 2006 10:08:37 +0900 Original-Received: from handa by etlken with local (Exim 3.36 #1 (Debian)) id 1FDtLw-0005XV-00; Tue, 28 Feb 2006 10:08:36 +0900 Original-To: Kevin Rodgers In-reply-to: (message from Kevin Rodgers on Wed, 08 Feb 2006 17:32:02 -0700) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/22.0.50 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:51089 Archived-At: Sorry for the late responce. In article , Kevin Rodgers writes: >> I thought we had discussed this already. The BOM-encodings should >> have priority since the likelihood of a misdetection is negligible >> (the character pair does not make sense at the start of a text in >> latin-1 in any language): the only thing that can reasonably be >> expected to happen is that a binary file is detected as utf-16. Not >> much of an issue, I'd say. I've just digged out old mails we exchanged on this topic (about a year ago). To my understanding, there was no clear conclusion. Here are the extracts: ------------------------------------------------------------ I wrote: > I think BOM is not that safe because there are many charsets > who have normal letters at 0xFE and 0xFF. Jason wrote: > But what are those characters, and are they likely to appear as a pair > at the beginning of the file, and nowhere else? I wrote: > Sorry, I don't know. Dave wrote: >> Exactly what Windows does for what? Recognizing a utf-16 registry >> file when opened in the registry editor? > Auto-detecting utf-16 generally. Although I don't think it would give > false positives on iso-8859 text, I don't know if it could with other > charsets. > > I could believe that Windows doesn't just go by byte-order-mark in > some locales where there might be a problem. If so, it could be > useful to do the same thing. ------------------------------------------------------------ For instance, I've just googled the two character sequence of 0xFE 0xFF of koi8 and found several occurrences. > Exactly. So why haven't these entries been added to > auto-coding-regexp-alist? > ("\\`\xEF\xBB\xBF" . utf-8) As far as I know, UTF-8 should not start with this sequence unless the text really starts with ZWNBSP (very unlikely). > ("\\`\xFE\xFF" . utf-16-be) > ("\\`\xFF\xFE" . utf-16-le) Although it's not clear how safe they are, if no one objects, I'll add them in auto-coding-regexp-alist. > ("\\`\x00\x00\xFE\xFF" . utf-32-be) > ("\\`\xFF\xFE\x00\x00" . utf-32-le) Emacs doesn't support those encoding for the momemnt. --- Kenichi Handa handa@m17n.org