From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen. Date: Thu, 19 Nov 2009 18:08:48 +0000 Message-ID: <20091119180848.GE1314@muc.de> References: <20091118191258.GA2676@muc.de> <20091119082040.GA1720@muc.de> <874ooq8xay.fsf@wanchan.jasonrumney.net> <20091119141852.GC1720@muc.de> <20091119155848.GB1314@muc.de> <87aayiihe9.fsf@lola.goethe.zz> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1258654094 21897 80.91.229.12 (19 Nov 2009 18:08:14 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 19 Nov 2009 18:08:14 +0000 (UTC) Cc: emacs-devel@gnu.org To: David Kastrup Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 19 19:08:06 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NBBQM-0000KP-3z for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 19:08:06 +0100 Original-Received: from localhost ([127.0.0.1]:45035 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBBQL-0004Nv-Bu for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 13:08:05 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NBBMF-0001gK-5S for emacs-devel@gnu.org; Thu, 19 Nov 2009 13:03:51 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NBBMA-0001dg-M7 for emacs-devel@gnu.org; Thu, 19 Nov 2009 13:03:50 -0500 Original-Received: from [199.232.76.173] (port=35568 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBBM9-0001dI-UU for emacs-devel@gnu.org; Thu, 19 Nov 2009 13:03:46 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:1752 helo=mail.muc.de) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1NBBM9-00022u-1z for emacs-devel@gnu.org; Thu, 19 Nov 2009 13:03:45 -0500 Original-Received: (qmail 56257 invoked by uid 3782); 19 Nov 2009 18:03:43 -0000 Original-Received: from acm.muc.de (pD9E51409.dip.t-dialin.net [217.229.20.9]) by colin2.muc.de (tmda-ofmipd) with ESMTP; Thu, 19 Nov 2009 19:03:42 +0100 Original-Received: (qmail 3471 invoked by uid 1000); 19 Nov 2009 18:08:48 -0000 Content-Disposition: inline In-Reply-To: <87aayiihe9.fsf@lola.goethe.zz> User-Agent: Mutt/1.5.9i X-Delivery-Agent: TMDA/1.1.5 (Fettercairn) X-Primary-Address: acm@muc.de X-detected-operating-system: by monty-python.gnu.org: FreeBSD 4.6-4.9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117275 Archived-At: Hi, David! On Thu, Nov 19, 2009 at 05:55:10PM +0100, David Kastrup wrote: > Alan Mackenzie writes: > > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote: > >> > The actual character in the string is ñ (#x3f). > >> No: the string does not contain any characters, only bytes, because > >> it's a unibyte string. > > I'm thinking from the lisp viewpoint. The string is a data > > structure which contains characters. I really don't want to have to > > think about the difference between "chars" and "bytes" when I'm > > hacking lisp. If I do, then the abstraction "string" is broken. > >> So it contains the byte 241, not the character ñ. > > That is then a bug. I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)". > Huh? ?ñ is the Emacs code point of ñ. Which is pretty much identical > to the Unicode code point in Emacs 23. No, you (all of you) are missing the point. That point is that if an Emacs Lisp hacker writes "?ñ", it should work, regardless of what "codepoint" it has, what "bytes" represent it, whether those "bytes" are coded with a different codepoint, or what have you. All of that stuff is uninteresting. If it gets interesting, like now, it is because it is buggy. > >> The byte 241 can be inserted in multibyte strings and buffers > >> because it is also a char of code 4194289 (which gets displayed as > >> \361). OK. Surely displaying it as "\361" is a bug? Should it not display as "\17777761". If it did, it would have saved half of my ranting. > > Hang on a mo'! How can the byte 241 "be" a char of code 4194289? > > This is some strange usage of the word "be" that I wasn't previously > > aware of. ;-) > Emacs encodes most of its things in utf-8. A Unicode code point is an > integer. You can encode it in different encodings, resulting in > different byte streams. Inside of a byte stream encoded in utf-8, the > isolated byte 241 does not correspond to a Unicode character. It is not > valid utf-8. When Emacs reads a file supposedly in utf-8, it wants to > represent _all_ possible byte streams in order to be able to save > unchanged data unmolested. That's a good explanation - it's sort of like < in html. Thanks. > So it encodes the entity "illegal isolated byte 241 in an utf-8 > document" with the character code 4194289 which has a representation in > Emacs' internal variant of utf-8, but is outside of the range of > Unicode. So, how did the character "ñ" get turned into the illegal byte #xf1? Is that the bug? > > At this point, would you please just agree with me that when I do > > (setq nl "\n") > > (aset nl 0 ?ñ) > > (insert nl) > > , what should appear on the screen should be "ñ", NOT "\361"? Thanks! > You assume that ?ñ is a character. I do indeed. It is self evident. Now, would you too please just agree that when I execute the three forms above, and "ñ" should appear? The identical argument applies to "ä". They are character used in writing wierd European languages like Spanish and German. Emacs should not have difficulty with them. It is a standard Emacs idiom that ?x (or ?\x) is the integer representing the character x. Indeed (unlike in XEmacs), characters ARE integers. Why does this not work for, e.g., ISO-8559-1? > But in Emacs, it is an integer, a Unicode code point in Emacs 23. That sounds like the sort of argument one might read on gnu-misc-discuss. ;-) Sorry. Are you saying that Emacs is converting "?ñ" and "?ä" into the wrong integers? > As long as there is something like a unibyte string, there is no way > to distinguish the character 241 and the byte 241 except when Emacs is > told explicitly. What is the correct Emacs internal representation for "ñ" and "ä"? They surely cannot share internal representations with other (non-)characters? > Because Emacs has no separate "character" data type. For which I am thankful. > -- > David Kastrup -- Alan Mackenzie (Nuremberg, Germany).