From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: utf-16le vs utf-16-le Date: Tue, 15 Apr 2008 03:25:51 +0900 Message-ID: <87lk3gfg40.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87wsn1fl72.fsf@uwakimon.sk.tsukuba.ac.jp> <87prssgacl.fsf@uwakimon.sk.tsukuba.ac.jp> <851w58q24a.fsf@lola.goethe.zz> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1208197395 3883 80.91.229.12 (14 Apr 2008 18:23:15 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 14 Apr 2008 18:23:15 +0000 (UTC) Cc: Eli Zaretskii , emacs-devel@gnu.org To: David Kastrup Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Apr 14 20:23:51 2008 connect(): Connection refused Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JlTEi-0000ar-KQ for ged-emacs-devel@m.gmane.org; Mon, 14 Apr 2008 20:17:01 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JlTE3-0001jL-MA for ged-emacs-devel@m.gmane.org; Mon, 14 Apr 2008 14:16:19 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JlTDz-0001fd-9a for emacs-devel@gnu.org; Mon, 14 Apr 2008 14:16:15 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JlTDx-0001ah-Hu for emacs-devel@gnu.org; Mon, 14 Apr 2008 14:16:14 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JlTDx-0001aQ-82 for emacs-devel@gnu.org; Mon, 14 Apr 2008 14:16:13 -0400 Original-Received: from mtps01.sk.tsukuba.ac.jp ([130.158.97.223]) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1JlTDh-0002n7-4K; Mon, 14 Apr 2008 14:15:58 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mtps01.sk.tsukuba.ac.jp (Postfix) with ESMTP id D73921535AC; Tue, 15 Apr 2008 03:15:54 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id 95AC81A29F3; Tue, 15 Apr 2008 03:25:51 +0900 (JST) In-Reply-To: <851w58q24a.fsf@lola.goethe.zz> X-Mailer: VM 7.19 under 21.5 (beta28) "fuki" 2785829fe37c XEmacs Lucid X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:95196 Archived-At: David Kastrup writes: > "Stephen J. Turnbull" writes: > > I don't know, in fact I think I think [having BOM-specific coding > > systems is] a bad idea. That's what the part of my message that > > you snipped was saying. But I'll have to defer to Handa-san on > > that. > > I think it obvious: if a BOM mark gets detected on read, one wants > to have it removed from the buffer and reinserted on saving the > buffer. I agree, as you state it, it's obvious. My question is "why does that need to be part of the coding system?" At present the UTF-16 and UTF-32 Unicode coding systems (in the abstract) have *twenty-seven* variants each (BOM-required, BOM-prohibited, BOM-autodetected X be, le, system-dependent X CR, LF, CRLF), and UTF-8 needs *nine*. This is nuts, from a user-education standpoint. What I proposed was a more generic concept where use of signatures and the EOL convention would (at least to the user) appear as buffer-local variables. > I am just not sure what the semantics for recoding/encoding/decoding > regions are. They should not mess with BOM in any case, I would > suppose. But then reading a file is not equivalent to reading it > literally in unibyte mode and then decoding the buffer-region. That's correct. The thing is, processing the BOM is a question of *initialization* of a stream. > Maybe there never was such an equivalence (can't be for shift codes, can > it?). In my view, there cannot be an equivalence. An Emacs buffer in unibyte mode is a *different* stream from the file it was read from, and the decision about BOM processing will have to be made differently from the way the decision is made at the time of reading from the file. You could add yet another option for BOM mode, namely "if this stream is an Emacs buffer that is visting a file in unibyte mode, then do BOM processing on conversion as if you were reading in the file in multibyte mode." I don't much like this....