From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen. Date: Thu, 19 Nov 2009 21:25:50 +0000 Message-ID: <20091119212550.GG1314@muc.de> References: <20091119082040.GA1720@muc.de> <874ooq8xay.fsf@wanchan.jasonrumney.net> <20091119141852.GC1720@muc.de> <20091119155848.GB1314@muc.de> <87aayiihe9.fsf@lola.goethe.zz> <20091119180848.GE1314@muc.de> <47325.130.55.118.19.1258658705.squirrel@webmail.lanl.gov> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1258665667 4733 80.91.229.12 (19 Nov 2009 21:21:07 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 19 Nov 2009 21:21:07 +0000 (UTC) Cc: David Kastrup , emacs-devel@gnu.org To: Davis Herring Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 19 22:20:59 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NBEQz-00016R-Bb for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 22:20:57 +0100 Original-Received: from localhost ([127.0.0.1]:55621 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBEQy-0004yA-Jo for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 16:20:56 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NBEQt-0004y1-Su for emacs-devel@gnu.org; Thu, 19 Nov 2009 16:20:52 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NBEQp-0004xJ-9u for emacs-devel@gnu.org; Thu, 19 Nov 2009 16:20:51 -0500 Original-Received: from [199.232.76.173] (port=41962 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBEQp-0004xG-60 for emacs-devel@gnu.org; Thu, 19 Nov 2009 16:20:47 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:4957 helo=mail.muc.de) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1NBEQo-0005P6-H0 for emacs-devel@gnu.org; Thu, 19 Nov 2009 16:20:46 -0500 Original-Received: (qmail 90297 invoked by uid 3782); 19 Nov 2009 21:20:44 -0000 Original-Received: from acm.muc.de (pD9E51409.dip.t-dialin.net [217.229.20.9]) by colin2.muc.de (tmda-ofmipd) with ESMTP; Thu, 19 Nov 2009 22:20:43 +0100 Original-Received: (qmail 6101 invoked by uid 1000); 19 Nov 2009 21:25:50 -0000 Content-Disposition: inline In-Reply-To: <47325.130.55.118.19.1258658705.squirrel@webmail.lanl.gov> User-Agent: Mutt/1.5.9i X-Delivery-Agent: TMDA/1.1.5 (Fettercairn) X-Primary-Address: acm@muc.de X-detected-operating-system: by monty-python.gnu.org: FreeBSD 4.6-4.9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117300 Archived-At: Hi, Davis, always good to hear from you! On Thu, Nov 19, 2009 at 11:25:05AM -0800, Davis Herring wrote: > [I end up having to say the same thing several times here; I thought it > preferable to omitting any of Alan's questions or any aspect of the > problem. It's not meant to be a rant.] > > No, you (all of you) are missing the point. That point is that if an > > Emacs Lisp hacker writes "?ñ", it should work, regardless of > > what "codepoint" it has, what "bytes" represent it, whether those > > "bytes" are coded with a different codepoint, or what have you. All of > > that stuff is uninteresting. If it gets interesting, like now, it is > > because it is buggy. > When you wrote ?ñ, it did work -- that character has the Unicode (and > Emacs 23) code point 241, so that two-character token is entirely > equivalent to the token "241" in Emacs source. (This is independent of > the encoding of the source file: the same two characters might be > represented by many different octet sequences in the source file, but you > always get 241 as the value (which is a code point and is distinct from > octet sequences anyway).) OK - so what's happening is that ?ñ is unambiguously 241. But Emacs cannot say whether that is unibyte 241 or multibyte 241, which it encodes as 4194289. Despite not knowing, Emacs is determined never to confuse a 4194289 type of 241 with a 241 type of 241. So, despite the fact that the character 4194289 probably originated as a unibyte ?ñ, it prints it uglily on the screen as "\361". > But you didn't insert that object! You forced it into a (perhaps > surprisingly: unibyte) string, which interpreted its argument (the integer > 241) as a raw byte value, because that's what unibyte strings contain. > When you then inserted the string, Emacs transformed it into a (somewhat > artificial) character whose meaning is "this was really the byte 241, > which, since it corresponds to no UTF-8 character, must merely be > reproduced literally on disk" and whose Emacs code point is 4194289. > (That integer looks like it could be derived from 241 by sign-extension > for the convenience of Emacs hackers; the connection is unimportant to the > user.) Why couldn't Emacs have simply displayed the character as "ñ"? Why does it have to enforce its internal dirty linen on an unsuspecting hacker? > > OK. Surely displaying it as "\361" is a bug? Should it not display > > as "\17777761". If it did, it would have saved half of my ranting. > No: characters are displayed according to their meaning, not their > internal code point. As it happens, this character's whole meaning is > "the byte #o361", so that's what's displayed. That meaning is an artificial one imposed by Emacs itself. Is there any pressing reason to distinguish 4194289 from 241 when displaying them as characters on a screen? > > So, how did the character "ñ" get turned into the illegal byte #xf1? > > Is that the bug? > By its use in `aset' in a unibyte context (determined entirely by the > target string). > >> You assume that ?ñ is a character. > > I do indeed. It is self evident. > Its characterness is determined by context, because (as you know) Emacs > has no distinct character type. So, in the isolation of English prose, we > have no way of telling whether ?ñ "is" a character or an integer, any more > than we can guess about 241. (We can guess about the writer's desires, > but not about the real effects.) > > Now, would you too please just agree that when I execute the three > > forms above, and "ñ" should appear? > That's Stefan's point: should common string literals generate multibyte > strings (so as to change the meaning, not of the string, but of `aset', > to what you want)? Lisp is a high level language. It should do the Right Thing in its representation of low level concepts, and shouldn't bug its users with these things. The situation is like having a text document with some characters in ISO-8559-1 and some in UTF-8. Chaos. I stick with one of these character sets for my personal stuff. > Maybe: one could also address the issue by disallowing `aset' on > unibyte strings (or strings entirely) and introducing `aset-unibyte' > (and perhaps `aset-multibyte') so that the argument interpretation (and > the O(n) nature of the latter) would be made clear to the programmer. No. The problem should be solved by deciding on one single character set visible to lisp hackers, and sticking to it rigidly. At least, that's my humble opinion as one of the Emacs hackers least well informed on the matter. ;-( > Maybe the doc-string for `aset' should just bear a really loud warning. Yes. But it's not really `aset' which is the liability. It's "?". > It bears more consideration than merely "yes" to your question, as > reasonable as it seems. > > What is the correct Emacs internal representation for "ñ" and "ä"? They > > surely cannot share internal representations with other > > (non-)characters? > They have the unique internal representation as (mostly) Unicode code > points (integers) 241 and 228, which happen to be identical to the > representations of bytes of those values (which interpretation prevails in > a unibyte context). Sorry, what the heck is "the byte with value 241"? Does this concept have any meaning, any utility beyond the machiavellian one of confusing me? How would one use "the byte with value 241", and why does it need to be kept distinct from "ñ"? > Davis -- Alan Mackenzie (Nuremberg, Germany).