From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.emacs.devel,gmane.emacs.pretest.bugs Subject: Re: 23.0.60; [nxml] BOM and utf-8 Date: Mon, 19 May 2008 22:57:20 +0200 Message-ID: <85ej7yqafj.fsf@lola.goethe.zz> References: <87od75kt78.fsf@pdrechsler.de> <87mymofip6.fsf@uwakimon.sk.tsukuba.ac.jp> <878wy8ny36.fsf@catnip.gol.com> <87k5hsfdvd.fsf@uwakimon.sk.tsukuba.ac.jp> <85y768ug6x.fsf@lola.goethe.zz> <87fxsff0xc.fsf@uwakimon.sk.tsukuba.ac.jp> <854p8vrxk5.fsf@lola.goethe.zz> <874p8uf2xm.fsf@uwakimon.sk.tsukuba.ac.jp> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1211230787 8824 80.91.229.12 (19 May 2008 20:59:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 19 May 2008 20:59:47 +0000 (UTC) Cc: emacs-pretest-bug@gnu.org, Patrick Drechsler , Miles Bader To: "Stephen J. Turnbull" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon May 19 23:00:24 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JyCR1-0006mj-Hs for ged-emacs-devel@m.gmane.org; Mon, 19 May 2008 22:58:19 +0200 Original-Received: from localhost ([127.0.0.1]:50173 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JyCQH-0001rd-Jq for ged-emacs-devel@m.gmane.org; Mon, 19 May 2008 16:57:33 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JyCQD-0001rG-8p for emacs-devel@gnu.org; Mon, 19 May 2008 16:57:29 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JyCQC-0001qy-Jt for emacs-devel@gnu.org; Mon, 19 May 2008 16:57:29 -0400 Original-Received: from [199.232.76.173] (port=60483 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JyCQC-0001qv-Gq for emacs-devel@gnu.org; Mon, 19 May 2008 16:57:28 -0400 Original-Received: from fencepost.gnu.org ([140.186.70.10]:40684) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1JyCQB-0003zG-V8 for emacs-devel@gnu.org; Mon, 19 May 2008 16:57:28 -0400 Original-Received: from mx10.gnu.org ([199.232.76.166]:34410) by fencepost.gnu.org with esmtp (Exim 4.67) (envelope-from ) id 1JyCOy-0000Kb-W7 for emacs-pretest-bug@gnu.org; Mon, 19 May 2008 16:56:13 -0400 Original-Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim 4.60) (envelope-from ) id 1JyCQ7-0003xv-Hm for emacs-pretest-bug@gnu.org; Mon, 19 May 2008 16:57:27 -0400 Original-Received: from mail-in-17.arcor-online.net ([151.189.21.57]:51525) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1JyCQ6-0003xR-Uk; Mon, 19 May 2008 16:57:23 -0400 Original-Received: from mail-in-15-z2.arcor-online.net (mail-in-15-z2.arcor-online.net [151.189.8.32]) by mail-in-17.arcor-online.net (Postfix) with ESMTP id 284402BC065; Mon, 19 May 2008 22:57:21 +0200 (CEST) Original-Received: from mail-in-12.arcor-online.net (mail-in-12.arcor-online.net [151.189.21.52]) by mail-in-15-z2.arcor-online.net (Postfix) with ESMTP id 0B28772427D; Mon, 19 May 2008 22:57:21 +0200 (CEST) Original-Received: from lola.goethe.zz (dslb-084-061-097-114.pools.arcor-ip.net [84.61.97.114]) by mail-in-12.arcor-online.net (Postfix) with ESMTP id BBC568C463; Mon, 19 May 2008 22:57:20 +0200 (CEST) Original-Received: by lola.goethe.zz (Postfix, from userid 1002) id 608901C464F9; Mon, 19 May 2008 22:57:20 +0200 (CEST) In-Reply-To: <874p8uf2xm.fsf@uwakimon.sk.tsukuba.ac.jp> (Stephen J. Turnbull's message of "Tue, 20 May 2008 05:34:45 +0900") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux) X-Virus-Scanned: ClamAV 0.92.1/7173/Mon May 19 21:30:02 2008 on mail-in-12.arcor-online.net X-Virus-Status: Clean X-detected-kernel: by monty-python.gnu.org: Linux 2.4-2.6 X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:97411 gmane.emacs.pretest.bugs:22391 Archived-At: "Stephen J. Turnbull" writes: > David Kastrup writes: > > "Stephen J. Turnbull" writes: > > > > In any case, maintaining faithfulness of representation is simply not > > > possible, as you point out > > > > With some coding systems. But the latin-* and utf-* can maintain > > the binary stream since their coding is required to be canonical in > > the standard. > > latin-* will do so because of their extremely limited range. It's > unfortunate that programmer intuitions about text have been > Americanized (== drastically limited) by these encodings. > > utf-* can maintain representation in the very limited sense you have > in mind, and I know that is very useful to you in dealing with non- > conforming applications like TeX. However, you still run into the > problem that faithfulness of representation is not a goal of Unicode. I am not interested in the "goal of Unicode" but in that of Emacs. Unicode is about text files. But Emacs communicates via byte streams and those are not necessarily text, or necessarily all text. > > > It's also not at all obvious that that is a very useful > > > requirement when dealing with a character-oriented standard like > > > Unicode or XML, since you can expect many applications to > > > canonicalize the text "behind your back". > > > > That's not an issue. > > What do you mean by "that's not an issue?" How can you know when I > haven't named the application? Because we are not talking about what arbitrary applications may do, but what Emacs should do. There may be other applications that tend to garble byte streams, and there might even be some Elisp applications that garble byte streams. But that does not mean that the Emacs core should feel nonchalant about garbling byte streams. > > Also you can load, edit and save a text file in colloborative > > environments, and the diffs/patches will be just in the edited > > areas (this will supposedly work better with Emacs-23 than > > Emacs-22). Those are quite important features. > > Sure, and Emacs must provide coding systems that preserve them, and > generally use those coding systems by default. Did anybody say > otherwise? So what was your point supposed to be? > > > Users should get used to it, and we should document how to force > > > Emacs to error rather than do anything behind your back for those > > > who need binary faithfulness rather than text faithfulness. > > > > Since binary faithfulness implies text faithfulness, there is no > > reason not to the right thing instead of erroring out. > > "There is no reason"? How arrogant of you! Rather, "David Kastrup > lacks the knowledge of the reasons." Here are three examples: > > Binary faithfulness may imply breaking text programs. For example, > `forward-char' and `replace-string' will give surprising results in a > buffer using Unicode internally that contains Unicode in NFD > normalization (and these anomolies will be noticeable in all Western > European languages excluding English). So forward-char and replace-string should be made to work as expected on non-normalized texts. One could even normalize texts and use text properties in order to restore the non-normalized form when communicating externally. > Binary faithfulness may imply inefficiency. For example, files need > not be normalized, which would imply keeping a copy of the whole file > and doing a Unicode diff to determine which parts of the file need to > be saved from the buffer and which parts from the saved copy. That sounds more like "binary faithfulness may inspire stupidity". Of course one needs to look for reasonable implementations. Inefficiency has not kept us from moving the Emacs-20.1 MULE model (where buffer and string offsets were byte-oriented) to the 20.7 model (no idea where the transition happened exactly) with character-based buffer and string offsets. Sometimes one has to balance sanity and efficiency. And there are ways for getting a reasonable amount of efficiency back. > Binary faithfulness may be incompatible with other user demands, for > example if a user introduces Latin-2 characters into a Latin-9 text. Why do you think we switched to utf-8 internally and got rid of latin unification? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum