From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Alan Mackenzie <acm@muc.de>
Newsgroups: gmane.emacs.devel
Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Date: Thu, 19 Nov 2009 18:08:48 +0000
Message-ID: <20091119180848.GE1314@muc.de>
References: <20091118191258.GA2676@muc.de>
	<jwvlji3fgzi.fsf-monnier+emacs@gnu.org>
	<20091119082040.GA1720@muc.de> <m3ws1mx1jw.fsf@hase.home>
	<874ooq8xay.fsf@wanchan.jasonrumney.net>
	<20091119141852.GC1720@muc.de>
	<jwvzl6icz60.fsf-monnier+emacs@gnu.org>
	<20091119155848.GB1314@muc.de> <87aayiihe9.fsf@lola.goethe.zz>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1258654094 21897 80.91.229.12 (19 Nov 2009 18:08:14 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 19 Nov 2009 18:08:14 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: David Kastrup <dak@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 19 19:08:06 2009
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1NBBQM-0000KP-3z
	for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 19:08:06 +0100
Original-Received: from localhost ([127.0.0.1]:45035 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1NBBQL-0004Nv-Bu
	for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 13:08:05 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NBBMF-0001gK-5S
	for emacs-devel@gnu.org; Thu, 19 Nov 2009 13:03:51 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NBBMA-0001dg-M7
	for emacs-devel@gnu.org; Thu, 19 Nov 2009 13:03:50 -0500
Original-Received: from [199.232.76.173] (port=35568 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NBBM9-0001dI-UU
	for emacs-devel@gnu.org; Thu, 19 Nov 2009 13:03:46 -0500
Original-Received: from colin.muc.de ([193.149.48.1]:1752 helo=mail.muc.de)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <acm@muc.de>) id 1NBBM9-00022u-1z
	for emacs-devel@gnu.org; Thu, 19 Nov 2009 13:03:45 -0500
Original-Received: (qmail 56257 invoked by uid 3782); 19 Nov 2009 18:03:43 -0000
Original-Received: from acm.muc.de (pD9E51409.dip.t-dialin.net [217.229.20.9]) by
	colin2.muc.de (tmda-ofmipd) with ESMTP;
	Thu, 19 Nov 2009 19:03:42 +0100
Original-Received: (qmail 3471 invoked by uid 1000); 19 Nov 2009 18:08:48 -0000
Content-Disposition: inline
In-Reply-To: <87aayiihe9.fsf@lola.goethe.zz>
User-Agent: Mutt/1.5.9i
X-Delivery-Agent: TMDA/1.1.5 (Fettercairn)
X-Primary-Address: acm@muc.de
X-detected-operating-system: by monty-python.gnu.org: FreeBSD 4.6-4.9
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:117275
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/117275>

Hi, David!

On Thu, Nov 19, 2009 at 05:55:10PM +0100, David Kastrup wrote:
> Alan Mackenzie <acm@muc.de> writes:

> > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote:
> >> > The actual character in the string is ñ (#x3f).

> >> No: the string does not contain any characters, only bytes, because
> >> it's a unibyte string.

> > I'm thinking from the lisp viewpoint.  The string is a data
> > structure which contains characters.  I really don't want to have to
> > think about the difference between "chars" and "bytes" when I'm
> > hacking lisp.  If I do, then the abstraction "string" is broken.

> >> So it contains the byte 241, not the character ñ.

> > That is then a bug.  I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)".

> Huh?  ?ñ is the Emacs code point of ñ.  Which is pretty much identical
> to the Unicode code point in Emacs 23.

No, you (all of you) are missing the point.  That point is that if an
Emacs Lisp hacker writes "?ñ", it should work, regardless of
what "codepoint" it has, what "bytes" represent it, whether those
"bytes" are coded with a different codepoint, or what have you.  All of
that stuff is uninteresting.  If it gets interesting, like now, it is
because it is buggy.

> >> The byte 241 can be inserted in multibyte strings and buffers
> >> because it is also a char of code 4194289 (which gets displayed as
> >> \361).

OK.  Surely displaying it as "\361" is a bug?  Should it not display as
"\17777761".  If it did, it would have saved half of my ranting.

> > Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?
> > This is some strange usage of the word "be" that I wasn't previously
> > aware of.  ;-)

> Emacs encodes most of its things in utf-8.  A Unicode code point is an
> integer.  You can encode it in different encodings, resulting in
> different byte streams.  Inside of a byte stream encoded in utf-8, the
> isolated byte 241 does not correspond to a Unicode character.  It is not
> valid utf-8.  When Emacs reads a file supposedly in utf-8, it wants to
> represent _all_ possible byte streams in order to be able to save
> unchanged data unmolested.

That's a good explanation - it's sort of like &lt; in html.  Thanks.

> So it encodes the entity "illegal isolated byte 241 in an utf-8
> document" with the character code 4194289 which has a representation in
> Emacs' internal variant of utf-8, but is outside of the range of
> Unicode.

So, how did the character "ñ" get turned into the illegal byte #xf1?  Is
that the bug?

> > At this point, would you please just agree with me that when I do

> >    (setq nl "\n")
> >    (aset nl 0 ?ñ)
> >    (insert nl)

> > , what should appear on the screen should be "ñ", NOT "\361"?  Thanks!

> You assume that ?ñ is a character.

I do indeed.  It is self evident.

Now, would you too please just agree that when I execute the three forms
above, and "ñ" should appear?

The identical argument applies to "ä".  They are character used in
writing wierd European languages like Spanish and German.  Emacs should
not have difficulty with them.  It is a standard Emacs idiom that ?x (or
?\x) is the integer representing the character x.  Indeed (unlike in
XEmacs), characters ARE integers.  Why does this not work for, e.g.,
ISO-8559-1?

> But in Emacs, it is an integer, a Unicode code point in Emacs 23.

That sounds like the sort of argument one might read on
gnu-misc-discuss.  ;-)  Sorry.  Are you saying that Emacs is converting
"?ñ" and "?ä" into the wrong integers? 

> As long as there is something like a unibyte string, there is no way
> to distinguish the character 241 and the byte 241 except when Emacs is
> told explicitly.

What is the correct Emacs internal representation for "ñ" and "ä"?  They
surely cannot share internal representations with other
(non-)characters?

> Because Emacs has no separate "character" data type.

For which I am thankful.

> -- 
> David Kastrup

-- 
Alan Mackenzie (Nuremberg, Germany).