From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: tomas@tuxteam.de (Tomas Zerolo)
Newsgroups: gmane.emacs.devel
Subject: Re: Problem with national characters in XHTML
Date: Sat, 1 Oct 2005 06:29:16 +0200
Message-ID: <20051001042916.GA29675@www.trapp.net>
References: <14e4cba14e7621.14e762114e4cba@net.lu.se>
	<E1EKZop-0000dG-00@etlken> <433AA30F.8050203@student.lu.se>
	<433AEB2D.7070906@student.lu.se>
	<20050929084322.GA16219@www.trapp.net>
	<luachwhub6.fsf@ohana.local> <433BF3FF.1070602@student.lu.se>
	<luzmpu42yv.fsf@ohana.lan> <433DC407.1070208@student.lu.se>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============0682976059=="
X-Trace: sea.gmane.org 1128141321 10272 80.91.229.2 (1 Oct 2005 04:35:21 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Sat, 1 Oct 2005 04:35:21 +0000 (UTC)
Cc: Piet van Oostrum <piet@cs.uu.nl>, emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Oct 01 06:35:18 2005
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1ELZ50-000319-22
	for ged-emacs-devel@m.gmane.org; Sat, 01 Oct 2005 06:34:34 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1ELZ4z-0004SK-Ca
	for ged-emacs-devel@m.gmane.org; Sat, 01 Oct 2005 00:34:33 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1ELZ4F-0004Pd-3k
	for emacs-devel@gnu.org; Sat, 01 Oct 2005 00:33:47 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1ELZ4B-0004Nn-Pq
	for emacs-devel@gnu.org; Sat, 01 Oct 2005 00:33:44 -0400
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1ELZ4B-0004Nd-Mx
	for emacs-devel@gnu.org; Sat, 01 Oct 2005 00:33:43 -0400
Original-Received: from [217.22.192.104] (helo=www.elogos.de)
	by monty-python.gnu.org with esmtp (Exim 4.34) id 1ELZ2R-0001h3-75
	for emacs-devel@gnu.org; Sat, 01 Oct 2005 00:31:55 -0400
Original-Received: from www.elogos.de (localhost [127.0.0.1])
	by www.elogos.de (Postfix) with ESMTP id 3AC89DB0CE;
	Sat,  1 Oct 2005 06:29:16 +0200 (CEST)
Original-Received: by www.elogos.de (Postfix, from userid 4000)
	id 26CD3DB17F; Sat,  1 Oct 2005 06:29:16 +0200 (CEST)
Original-To: Lennart Borgman <lennart.borgman.073@student.lu.se>
In-Reply-To: <433DC407.1070208@student.lu.se>
User-Agent: Mutt/1.5.6+20040907i
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:43414
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/43414>


--===============0682976059==
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="8t9RHnE3ZwKMSgU+"
Content-Disposition: inline


--8t9RHnE3ZwKMSgU+
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Oct 01, 2005 at 01:02:31AM +0200, Lennart Borgman wrote:
> Piet van Oostrum wrote:
[...]
> >That is just the internal representation of the character in Emacs. It's
> >not important. What matters is what Emacs writes to your file. When you
> >write out utf-8 (for example by giving the command
[...]
> So you mean that at a - what should I call it? - "text semantic level"=20
> the utf-8 char and the latin-1 char has the same meaning?

Yes. You put that nicely. The *character* (a dieresis) stays the same.
The *representation* (loosely referred to as `encoding') changes.

I said loosely, because on more complex things as utf-8 there are
actually two layers: the `character set', mapping each character to an
integer (aka `code point', which in this case would be UNICODE or
ISO-10646, which nowadays are equivalent), and the representation in a
file, which may be utf-8 (most common), ucs-16 or whatnot.

Now the advantage of utf-8: it is a variable-width encoding, and uses up
just one byte for one ASCII character (on ASCII it uses the same code
points). So you can interpret an ASCII file ``as-is'' as an utf-8 file.

For higher characters (the ones, for example with codes >127 in
iso-8859-1 (aka Latin1)), you need more than one byte in utf-8. AFAIK,
up to 6 bytes, but don't take that too seriously.

The disadvantage is: it is a variable-width encoding, so you have to
process a text sequentially, byte-for-byte to get the character
boundaries right (it's designed to re-synchronize gracefully, though).

If you want the whole story (on UNICODE, ISO10646, UTF8), see here:

  <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

(very recommended). From the perspective of a web slave, see:

  <http://www.w3.org/TR/REC-html40/charset.html>

HTH
-- tomas

--8t9RHnE3ZwKMSgU+
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFDPhCcBcgs9XrR2kYRAoomAJ9I91PNcCWkD6KTY28IjjW8JRt9LgCfWcJl
gNqu4LO6TBCifr627PMEIXc=
=HFRq
-----END PGP SIGNATURE-----

--8t9RHnE3ZwKMSgU+--


--===============0682976059==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel
--===============0682976059==--