From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Werner LEMBERG Newsgroups: gmane.emacs.devel Subject: Re: coding tags and utf-16 Date: Wed, 04 Jan 2006 15:58:21 +0100 (CET) Message-ID: <20060104.155821.10305768.wl@gnu.org> References: <20051221.090033.182620434.wl@gnu.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1136399406 6842 80.91.229.2 (4 Jan 2006 18:30:06 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 4 Jan 2006 18:30:06 +0000 (UTC) Cc: groff@gnu.org, bruno@clisp.org, emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 04 19:30:03 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1EuDOI-0006yP-5K for ged-emacs-devel@m.gmane.org; Wed, 04 Jan 2006 19:29:43 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1EuDPy-00026d-BB for ged-emacs-devel@m.gmane.org; Wed, 04 Jan 2006 13:31:26 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1EuC9S-0003EV-Ms for emacs-devel@gnu.org; Wed, 04 Jan 2006 12:10:18 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1EuC9P-0003Dm-SR for emacs-devel@gnu.org; Wed, 04 Jan 2006 12:10:17 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1EuC9O-0003DV-DQ; Wed, 04 Jan 2006 12:10:14 -0500 Original-Received: from [212.227.126.183] (helo=moutng.kundenserver.de) by monty-python.gnu.org with esmtp (Exim 4.34) id 1EuCAo-0005XR-SD; Wed, 04 Jan 2006 12:11:43 -0500 Original-Received: from [84.175.198.108] (helo=rigel.site) by mrelayeu.kundenserver.de (node=mrelayeu1) with ESMTP (Nemesis), id 0MKwpI-1EuC7b2BcM-000789; Wed, 04 Jan 2006 18:08:24 +0100 Original-Received: from localhost (localhost [127.0.0.1]) by rigel.site (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id k04EwMT5019843; Wed, 4 Jan 2006 15:58:22 +0100 Original-To: handa@m17n.org In-Reply-To: X-Mailer: Mew version 4.2.54 on Emacs 22.0.50.1 / Mule 5.0 (SAKAKI) X-Provags-ID: kundenserver.de abuse@kundenserver.de login:2dc398bc694a1e60948148ba0a42c0da X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:48696 Archived-At: > > There is a serious problem with coding tags and utf-16 encodings > > of any flavour: Emacs simply can't recognize the tag. This is a > > non-trivial problem. > > Sorry for the late reply, but I think coding tag is useless for a > file encoded in some of utf-16 variants. > > If a file has BOM at the head, BOM should tell the exact encoding > whatever is specified in coding tag. > > If a file is encoded without BOM, we must use the less reliable > heuristics to guess utf-16be or utf-16le. If you find a coding-tag > spec by ignoring all zero bytes at even byte indexes, it means that > the file is, in high possibility, utf-16be whatever the tag value > is. If you find a coding-tag spec by ignoring all zero bytes at odd > byte indexes, it means that the file is utf-16le whatever the tag > value is. > > So, in any cases, a tag value itself is useless. [...] I'll do the following for groff's preprocessor, preconv: . If the data starts with a BOM, use it, and ignore the coding tag. . Otherwise, if there are zero bytes in the first two lines, ignore those zero values, emit a warning, and use the coding tag, if any. . Otherwise, use the default encoding -- this normally will lead to a wrong result and make groff explode, but I consider this better than to apply heuristics, especially if you have to recognize both UTF16 and UTF32 variants. This is probably a suboptimal solution but quite easy to implement, and the user can always explicitly select an encoding on the command line. Perhaps someone finds (and implements) a better way which I can then adapt to preconv. Werner