From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.emacs.devel Subject: Re: Inadequate documentation of silly characters on screen. Date: Sat, 21 Nov 2009 13:33:23 +0100 Organization: Organization?!? Message-ID: <87lji0awh8.fsf@lola.goethe.zz> References: <20091118191258.GA2676@muc.de> <20091119082040.GA1720@muc.de> <87aayitvoy.fsf@wanchan.jasonrumney.net> <87ocmyf6so.fsf@catnip.gol.com> <87vdh57tp2.fsf@uwakimon.sk.tsukuba.ac.jp> <878we1ekb0.fsf@uwakimon.sk.tsukuba.ac.jp> <87hbso347j.fsf@uwakimon.sk.tsukuba.ac.jp> <877htk2xbk.fsf@uwakimon.sk.tsukuba.ac.jp> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1258806851 25643 80.91.229.12 (21 Nov 2009 12:34:11 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 21 Nov 2009 12:34:11 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Nov 21 13:34:04 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NBpAB-0004Vg-5b for ged-emacs-devel@m.gmane.org; Sat, 21 Nov 2009 13:34:03 +0100 Original-Received: from localhost ([127.0.0.1]:52518 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBpAA-0002ZI-E5 for ged-emacs-devel@m.gmane.org; Sat, 21 Nov 2009 07:34:02 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NBpA5-0002Xt-0t for emacs-devel@gnu.org; Sat, 21 Nov 2009 07:33:57 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NBpA0-0002Sk-Ef for emacs-devel@gnu.org; Sat, 21 Nov 2009 07:33:56 -0500 Original-Received: from [199.232.76.173] (port=47082 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBpA0-0002Sf-CO for emacs-devel@gnu.org; Sat, 21 Nov 2009 07:33:52 -0500 Original-Received: from lo.gmane.org ([80.91.229.12]:49377) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1NBp9z-0004gh-Rl for emacs-devel@gnu.org; Sat, 21 Nov 2009 07:33:52 -0500 Original-Received: from list by lo.gmane.org with local (Exim 4.50) id 1NBp9u-0004Qx-U3 for emacs-devel@gnu.org; Sat, 21 Nov 2009 13:33:46 +0100 Original-Received: from p5b2c26da.dip.t-dialin.net ([91.44.38.218]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 21 Nov 2009 13:33:46 +0100 Original-Received: from dak by p5b2c26da.dip.t-dialin.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 21 Nov 2009 13:33:46 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 73 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: p5b2c26da.dip.t-dialin.net X-Face: 2FEFf>]>q>2iw=B6, xrUubRI>pR&Ml9=ao@P@i)L:\urd*t9M~y1^:+Y]'C0~{mAl`oQuAl \!3KEIp?*w`|bL5qr,H)LFO6Q=qx~iH4DN; i"; /yuIsqbLLCh/!U#X[S~(5eZ41to5f%E@'ELIi$t^ Vc\LWP@J5p^rst0+('>Er0=^1{]M9!p?&:\z]|;&=NP3AhB!B_bi^]Pfkw User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux) Cancel-Lock: sha1:jFYksArfgrK9EbSKw0urqi76HRc= X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117437 Archived-At: "Stephen J. Turnbull" writes: > Stefan Monnier writes: > > > I don't know what you mean. The eight-bit "chars" were introduced > > to make sure that decoding+reencoding will always return the exact > > same byte-sequence, no matter what coding-system was used > > (i.e. even if the byte-sequence is invaldi for that coding-system). > > Dunno how XEmacs handles it. > > Honestly, it currently doesn't, or doesn't very well, despite some > work by Aidan. But we don't need to make this a problem for _Emacs_. > However, I think a well-behaved platform should by default error > (something derived from invalid-state, in XEmacs's error hierarchy) in > such a case; normally this means corruption in the file. We take care that it does not mean corruption. And more often it means that you might have been loading with the wrong encoding (people do that all the time). If you edit some innocent ASCII part and save again, you won't appreciate changes all across the file elsewhere in parts you did not touch or see on-screen. Sometimes there is no "right encoding". If I load an executable or an image file with tag strings and change one string in overwrite mode, I want to be able to save again. Compiled Elisp files contain binary strings as well. There may be source files with binary blobs in them, there may be files with parts in different encodings and so on. > There are special cases like utf8latex whose error messages give you a > certain number of octets without respecting character boundaries; I > agree there is need to handle this case. Forget about the TeX problem: that is a red herring. It is just one case where irrevertable corruption is not the right answer. In fact, I know of no case where irrevertable corruption is the right answer. "Don't touch what you don't understand" is a good rationale. For XEmacs, following this rationale would currently require erroring out. And I actually recommend that you do so: you will learn the hard way that users like the Emacs solution of "don't touch what you don't understand", namely having artificial code points for losslessly representing the parts Emacs does not understand in a particular encoding, better. > What Python 3 (PEP 383) does is provide a family of coding system > variants which use invalid Unicode surrogates to encode "raw bytes" > for situations where the user asks you to proceed despite invalid > octet sequences for the coding system; since Emacs's internal code is > UTF-8, any Unicode surrogate is invalid and could be used for this > purpose. This would make non-Emacs apps barf errors on such Emacs > autosaves, but they'll probably barf on the source file, too. We currently _have_ such a scheme in place. We just use different Unicode-invalid code points. > There's a typo in the expr above, should be "multibyte-string". The > proposed treatment of 241 is due to the fact that it is currently > illegal in multibyte strings AIUI. It is a perfectly valid character ñ in multibyte strings, but not represented by its single-byte/latin-1 equivalent. > Re widechar buffers: the codes for Latin-1 characters in UTF-16 and > UTF-32 are just zero-padded extensions of the unibyte codes. I think you may be muddling characters and their byte sequence representations. At least I can't read much sense into this statement otherwise. -- David Kastrup