From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: tomas@tuxteam.de (Tomas Zerolo) Newsgroups: gmane.emacs.devel Subject: Re: Problem with national characters in XHTML Date: Sat, 1 Oct 2005 06:29:16 +0200 Message-ID: <20051001042916.GA29675@www.trapp.net> References: <14e4cba14e7621.14e762114e4cba@net.lu.se> <433AA30F.8050203@student.lu.se> <433AEB2D.7070906@student.lu.se> <20050929084322.GA16219@www.trapp.net> <433BF3FF.1070602@student.lu.se> <433DC407.1070208@student.lu.se> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0682976059==" X-Trace: sea.gmane.org 1128141321 10272 80.91.229.2 (1 Oct 2005 04:35:21 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 1 Oct 2005 04:35:21 +0000 (UTC) Cc: Piet van Oostrum , emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Oct 01 06:35:18 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1ELZ50-000319-22 for ged-emacs-devel@m.gmane.org; Sat, 01 Oct 2005 06:34:34 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ELZ4z-0004SK-Ca for ged-emacs-devel@m.gmane.org; Sat, 01 Oct 2005 00:34:33 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1ELZ4F-0004Pd-3k for emacs-devel@gnu.org; Sat, 01 Oct 2005 00:33:47 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1ELZ4B-0004Nn-Pq for emacs-devel@gnu.org; Sat, 01 Oct 2005 00:33:44 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ELZ4B-0004Nd-Mx for emacs-devel@gnu.org; Sat, 01 Oct 2005 00:33:43 -0400 Original-Received: from [217.22.192.104] (helo=www.elogos.de) by monty-python.gnu.org with esmtp (Exim 4.34) id 1ELZ2R-0001h3-75 for emacs-devel@gnu.org; Sat, 01 Oct 2005 00:31:55 -0400 Original-Received: from www.elogos.de (localhost [127.0.0.1]) by www.elogos.de (Postfix) with ESMTP id 3AC89DB0CE; Sat, 1 Oct 2005 06:29:16 +0200 (CEST) Original-Received: by www.elogos.de (Postfix, from userid 4000) id 26CD3DB17F; Sat, 1 Oct 2005 06:29:16 +0200 (CEST) Original-To: Lennart Borgman In-Reply-To: <433DC407.1070208@student.lu.se> User-Agent: Mutt/1.5.6+20040907i X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:43414 Archived-At: --===============0682976059== Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="8t9RHnE3ZwKMSgU+" Content-Disposition: inline --8t9RHnE3ZwKMSgU+ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Oct 01, 2005 at 01:02:31AM +0200, Lennart Borgman wrote: > Piet van Oostrum wrote: [...] > >That is just the internal representation of the character in Emacs. It's > >not important. What matters is what Emacs writes to your file. When you > >write out utf-8 (for example by giving the command [...] > So you mean that at a - what should I call it? - "text semantic level"=20 > the utf-8 char and the latin-1 char has the same meaning? Yes. You put that nicely. The *character* (a dieresis) stays the same. The *representation* (loosely referred to as `encoding') changes. I said loosely, because on more complex things as utf-8 there are actually two layers: the `character set', mapping each character to an integer (aka `code point', which in this case would be UNICODE or ISO-10646, which nowadays are equivalent), and the representation in a file, which may be utf-8 (most common), ucs-16 or whatnot. Now the advantage of utf-8: it is a variable-width encoding, and uses up just one byte for one ASCII character (on ASCII it uses the same code points). So you can interpret an ASCII file ``as-is'' as an utf-8 file. For higher characters (the ones, for example with codes >127 in iso-8859-1 (aka Latin1)), you need more than one byte in utf-8. AFAIK, up to 6 bytes, but don't take that too seriously. The disadvantage is: it is a variable-width encoding, so you have to process a text sequentially, byte-for-byte to get the character boundaries right (it's designed to re-synchronize gracefully, though). If you want the whole story (on UNICODE, ISO10646, UTF8), see here: (very recommended). From the perspective of a web slave, see: HTH -- tomas --8t9RHnE3ZwKMSgU+ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iD8DBQFDPhCcBcgs9XrR2kYRAoomAJ9I91PNcCWkD6KTY28IjjW8JRt9LgCfWcJl gNqu4LO6TBCifr627PMEIXc= =HFRq -----END PGP SIGNATURE----- --8t9RHnE3ZwKMSgU+-- --===============0682976059== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel --===============0682976059==--