From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Inadequate documentation of silly characters on screen. Date: Sat, 21 Nov 2009 15:42:23 +0900 Message-ID: <877htk2xbk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20091118191258.GA2676@muc.de> <20091119082040.GA1720@muc.de> <87aayitvoy.fsf@wanchan.jasonrumney.net> <87ocmyf6so.fsf@catnip.gol.com> <87vdh57tp2.fsf@uwakimon.sk.tsukuba.ac.jp> <878we1ekb0.fsf@uwakimon.sk.tsukuba.ac.jp> <87hbso347j.fsf@uwakimon.sk.tsukuba.ac.jp> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1258785391 13177 80.91.229.12 (21 Nov 2009 06:36:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 21 Nov 2009 06:36:31 +0000 (UTC) Cc: Miles Bader , Alan Mackenzie , emacs-devel@gnu.org, Jason Rumney To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Nov 21 07:36:23 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NBja2-0008Se-MA for ged-emacs-devel@m.gmane.org; Sat, 21 Nov 2009 07:36:22 +0100 Original-Received: from localhost ([127.0.0.1]:38558 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBja2-0006Ng-1f for ged-emacs-devel@m.gmane.org; Sat, 21 Nov 2009 01:36:22 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NBjZs-0006L6-0y for emacs-devel@gnu.org; Sat, 21 Nov 2009 01:36:12 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NBjZl-0006EF-Sw for emacs-devel@gnu.org; Sat, 21 Nov 2009 01:36:10 -0500 Original-Received: from [199.232.76.173] (port=58872 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBjZl-0006E1-DL for emacs-devel@gnu.org; Sat, 21 Nov 2009 01:36:05 -0500 Original-Received: from mtps01.sk.tsukuba.ac.jp ([130.158.97.223]:49844) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NBjZe-00055m-QP; Sat, 21 Nov 2009 01:35:59 -0500 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mtps01.sk.tsukuba.ac.jp (Postfix) with ESMTP id A96CC1537B6; Sat, 21 Nov 2009 15:35:54 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id 6F6181A25EE; Sat, 21 Nov 2009 15:42:23 +0900 (JST) In-Reply-To: X-Mailer: VM 8.0.12-devo-585 under 21.5 (beta29) "garbanzo" d20e0a45a4b2 XEmacs Lucid (x86_64-unknown-linux) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117412 Archived-At: Stefan Monnier writes: > I don't know what you mean. The eight-bit "chars" were introduced to > make sure that decoding+reencoding will always return the exact same > byte-sequence, no matter what coding-system was used (i.e. even if the > byte-sequence is invaldi for that coding-system). Dunno how XEmacs > handles it. Honestly, it currently doesn't, or doesn't very well, despite some work by Aidan. However, I think a well-behaved platform should by default error (something derived from invalid-state, in XEmacs's error hierarchy) in such a case; normally this means corruption in the file. There are special cases like utf8latex whose error messages give you a certain number of octets without respecting character boundaries; I agree there is need to handle this case. What Python 3 (PEP 383) does is provide a family of coding system variants which use invalid Unicode surrogates to encode "raw bytes" for situations where the user asks you to proceed despite invalid octet sequences for the coding system; since Emacs's internal code is UTF-8, any Unicode surrogate is invalid and could be used for this purpose. This would make non-Emacs apps barf errors on such Emacs autosaves, but they'll probably barf on the source file, too. > > And it should be either an error to (aset string pos 241) (sorry > > Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?=F1). = I > > favor the former, because what Alan is doing screws Spanish-speaking > > users AFAICS. OTOH, the latter extends naturally if you have plans to > > add support for fixed-width Unicode buffers (UTF-16 and UTF-32). >=20 > I understand this even less. There's a typo in the expr above, should be "multibyte-string". The proposed treatment of 241 is due to the fact that it is currently illegal in multibyte strings AIUI. Re the bit about Spanish-speakers: AIUI, Alan is translating multiline strings to oneline strings by using an unusual graphic character. But it's only unusual in non-Spanish cases; Spanish-speakers may very well want to include comments like "=A1I wanna write this comment in Espa=F1ol!" which would presumably get unfolded to "=A1I wanna write this comment in Espa\nol!" Not very nice. Re widechar buffers: the codes for Latin-1 characters in UTF-16 and UTF-32 are just zero-padded extensions of the unibyte codes. I'm pretty sure it's this kind of thing that Ben had in mind when he originally designed the XEmacs version of the Mule internal encoding to make (=3D (char-int ?=F1) 241) true in all versions of XEmacs. > I think XEmacs's fundamental tradeoffs are subtly different but > lead to very far-reaching consequences, Indeed, but I'm not talking about XEmacs, except for comparison of techniques.