unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Problem with national characters in XHTML
@ 2005-09-28  8:29 LENNART BORGMAN
  2005-09-28 10:19 ` Jason Rumney
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: LENNART BORGMAN @ 2005-09-28  8:29 UTC (permalink / raw)


I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
            "http://www.w3.org/TR/REC-html40/loose.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks ok in Internet Explorer, but not in Firefox. Looking at the file with Notepad also shows the swedish characters as expected.

I would be glad for some hints and pointers! I am using nxml-mode if that matters here.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28  8:29 Problem with national characters in XHTML LENNART BORGMAN
@ 2005-09-28 10:19 ` Jason Rumney
  2005-09-28 10:22 ` David Hansen
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Jason Rumney @ 2005-09-28 10:19 UTC (permalink / raw)
  Cc: emacs-devel@gnu.org

LENNART BORGMAN wrote:

>  <?xml version="1.0" encoding="utf-8"?>
>...
>The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks ok in Internet Explorer, but not in Firefox. Looking at the file with Notepad also shows the swedish characters as expected.
>  
>
Emacs and Firefox are doing the right thing. The byte \344 by itself is 
not a valid UTF-8 character. Replace it with ä in Emacs, and it should 
appear correctly.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28  8:29 Problem with national characters in XHTML LENNART BORGMAN
  2005-09-28 10:19 ` Jason Rumney
@ 2005-09-28 10:22 ` David Hansen
  2005-09-28 10:22 ` Paul Pogonyshev
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: David Hansen @ 2005-09-28 10:22 UTC (permalink / raw)


On Wed, 28 Sep 2005 10:29:21 +0200 LENNART BORGMAN wrote:

> I have run into a problem with swedish national characters in
> an XHTML document. The header of the document is like this:
>
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>
> The swedish character ä looks like \344 in CVS Emacs
> (2005-09-23). 

\344 is a Latin-1 encoded ä not UTF-8.

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28  8:29 Problem with national characters in XHTML LENNART BORGMAN
  2005-09-28 10:19 ` Jason Rumney
  2005-09-28 10:22 ` David Hansen
@ 2005-09-28 10:22 ` Paul Pogonyshev
  2005-09-28 10:41 ` Tomas Zerolo
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Paul Pogonyshev @ 2005-09-28 10:22 UTC (permalink / raw)


LENNART BORGMAN wrote:
> I have run into a problem with swedish national characters in an XHTML
> document. The header of the document is like this:
>
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>
> The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks
> ok in Internet Explorer, but not in Firefox. Looking at the file with
> Notepad also shows the swedish characters as expected.
>
> I would be glad for some hints and pointers! I am using nxml-mode if that
> matters here.

There is probably conflict of encodings.  Note that encoding is often duplicated
in <meta ... /> tag:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
	  PUBLIC "-//W3C//DTD XHTML 1.1//EN"
	  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>

<head>
  ...  
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  ...

Check that you have UTF-8 there too.  Finally, check that your non-ASCII characters
are indead encoded in UTF-8.

Paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28  8:29 Problem with national characters in XHTML LENNART BORGMAN
                   ` (2 preceding siblings ...)
  2005-09-28 10:22 ` Paul Pogonyshev
@ 2005-09-28 10:41 ` Tomas Zerolo
  2005-09-28 10:44 ` Juanma Barranquero
  2005-09-28 11:09 ` Kenichi Handa
  5 siblings, 0 replies; 20+ messages in thread
From: Tomas Zerolo @ 2005-09-28 10:41 UTC (permalink / raw)
  Cc: emacs-devel@gnu.org


[-- Attachment #1.1: Type: text/plain, Size: 1565 bytes --]

On Wed, Sep 28, 2005 at 10:29:21AM +0200, LENNART BORGMAN wrote:
> I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:
> 
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

Hm. Note that the header says of itself that it's encoded in utf-8. I
don't know whether it's relevant.

> The swedish character ä looks like \344 in CVS Emacs (2005-09-23).

If Emacs honors the header above, then this won't work: Octal 344 is an
a-with-dieresis, but in iso 8859-1 encoding, not utf-8.

> It looks ok in Internet Explorer, but not in Firefox.

I'd say Firefox is right on this one ;-)

Seriously: you can force the browser to assume an encoding, so what the
browser shows depends on settings which may vary from time to time. On
Firefox, it's under View -> Character Encoding. No idea about IE (and
I'm glad not to know ;-).

>                                                       Looking at the
> file with Notepad also shows the swedish characters as expected.

Notepad uses whatever encoding its font has; i guess an 8-bit fixed
encoding.

> I would be glad for some hints and pointers! I am using nxml-mode if
> that matters here.

You may try two things: changing the utf-8 in the header to iso-8859-1
or (better) insert your a-dieresis as an utf8-encoded char.

Regards
-- tomás

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28  8:29 Problem with national characters in XHTML LENNART BORGMAN
                   ` (3 preceding siblings ...)
  2005-09-28 10:41 ` Tomas Zerolo
@ 2005-09-28 10:44 ` Juanma Barranquero
  2005-09-29 11:11   ` Mathias Dahl
  2005-09-28 11:09 ` Kenichi Handa
  5 siblings, 1 reply; 20+ messages in thread
From: Juanma Barranquero @ 2005-09-28 10:44 UTC (permalink / raw)
  Cc: emacs-devel@gnu.org

On 9/28/05, LENNART BORGMAN <lennart.borgman.073@student.lu.se> wrote:

> I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:
>
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>
> The swedish character ä looks like \344 in CVS Emacs (2005-09-23).

Hmm. An XHTML document with encoding="utf-8" should not have "swedish
national characters" in it, should it? Upon reading the file, Emacs
will set its coding system to mule-utf-8, so it's no surprise than
high-bit, non-valid utf8 byte sequences appear as \xxx...

I've created a document with your header, and put an "É" in it with
notepad. Emacs shows this char as \311. I would not consider this an
error :)

--
                    /L/e/k/t/u

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
@ 2005-09-28 11:08 LENNART BORGMAN
  0 siblings, 0 replies; 20+ messages in thread
From: LENNART BORGMAN @ 2005-09-28 11:08 UTC (permalink / raw)


Ok, thanks for help to all that replied. I tried to learn a bit;-)

Putting iso-8859-1 in the header instead of utf-8 as Tomas Zerolo suggested solved the problem.


----- Original Message -----
From: Juanma Barranquero <lekktu@gmail.com>
Date: Wednesday, September 28, 2005 12:44 pm
Subject: Re: Problem with national characters in XHTML

> On 9/28/05, LENNART BORGMAN <lennart.borgman.073@student.lu.se> wrote:
> 
> > I have run into a problem with swedish national characters in an 
> XHTML document. The header of the document is like this:
> >
> >   <?xml version="1.0" encoding="utf-8"?>
> >   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
> >             "http://www.w3.org/TR/REC-html40/loose.dtd">
> >   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
> >
> > The swedish character ä looks like \344 in CVS Emacs (2005-09-23).
> 
> Hmm. An XHTML document with encoding="utf-8" should not have "swedish
> national characters" in it, should it? Upon reading the file, Emacs
> will set its coding system to mule-utf-8, so it's no surprise than
> high-bit, non-valid utf8 byte sequences appear as \xxx...
> 
> I've created a document with your header, and put an "É" in it with
> notepad. Emacs shows this char as \311. I would not consider this an
> error :)
> 
> --
>                    /L/e/k/t/u
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28  8:29 Problem with national characters in XHTML LENNART BORGMAN
                   ` (4 preceding siblings ...)
  2005-09-28 10:44 ` Juanma Barranquero
@ 2005-09-28 11:09 ` Kenichi Handa
  2005-09-28 14:05   ` Lennart Borgman
  5 siblings, 1 reply; 20+ messages in thread
From: Kenichi Handa @ 2005-09-28 11:09 UTC (permalink / raw)
  Cc: emacs-devel

In article <14e4cba14e7621.14e762114e4cba@net.lu.se>, LENNART BORGMAN <lennart.borgman.073@student.lu.se> writes:

> I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

> The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks ok in Internet Explorer, but not in Firefox. Looking at the file with Notepad also shows the swedish characters as expected.

> I would be glad for some hints and pointers! I am using nxml-mode if that matters here.

Could you please send me the whole file?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28 11:09 ` Kenichi Handa
@ 2005-09-28 14:05   ` Lennart Borgman
  2005-09-28 19:12     ` Lennart Borgman
  0 siblings, 1 reply; 20+ messages in thread
From: Lennart Borgman @ 2005-09-28 14:05 UTC (permalink / raw)
  Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1520 bytes --]

Kenichi Handa wrote:

>In article <14e4cba14e7621.14e762114e4cba@net.lu.se>, LENNART BORGMAN <lennart.borgman.073@student.lu.se> writes:
>
>  
>
>>I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:
>>  <?xml version="1.0" encoding="utf-8"?>
>>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>>            "http://www.w3.org/TR/REC-html40/loose.dtd">
>>  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>>    
>>
>
>  
>
>>The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks ok in Internet Explorer, but not in Firefox. Looking at the file with Notepad also shows the swedish characters as expected.
>>    
>>
>
>  
>
>>I would be glad for some hints and pointers! I am using nxml-mode if that matters here.
>>    
>>
>
>Could you please send me the whole file?
>  
>
I have attached to test files in XHTML, one user utf-8 in the header and 
the other iso-8859-1. Those files tells what is displayed in IE and 
Firefox and how the swedish character ä was entered (though I guess some 
info might be missing for the experts here).

I find this a bit confusing still. What character is entered by Emacs 
when I type ä on my swedish keyboard? When I look at the character ä in 
Emacs with (following-char) it in both test files returns 2276. Is that 
what I would expect in the iso-8859-1 test file? (It starts with <?xml 
version="1.0" encoding="iso-8859-1"?>)


[-- Attachment #2: test-xhtml-iso-8859-1.html --]
[-- Type: text/html, Size: 1289 bytes --]

[-- Attachment #3: test-xhtml-utf-8.html --]
[-- Type: text/html, Size: 1859 bytes --]

[-- Attachment #4: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28 14:05   ` Lennart Borgman
@ 2005-09-28 19:12     ` Lennart Borgman
  2005-09-29  8:43       ` Tomas Zerolo
  0 siblings, 1 reply; 20+ messages in thread
From: Lennart Borgman @ 2005-09-28 19:12 UTC (permalink / raw)
  Cc: emacs-devel, Kenichi Handa

Lennart Borgman wrote:

> Kenichi Handa wrote:
>
>> Could you please send me the whole file?
>>  
>>
> I have attached to test files in XHTML, one user utf-8 in the header 
> and the other iso-8859-1. Those files tells what is displayed in IE 
> and Firefox and how the swedish character ä was entered (though I 
> guess some info might be missing for the experts here).
>
> I find this a bit confusing still. What character is entered by Emacs 
> when I type ä on my swedish keyboard? When I look at the character ä 
> in Emacs with (following-char) it in both test files returns 2276. Is 
> that what I would expect in the iso-8859-1 test file? (It starts with 
> <?xml version="1.0" encoding="iso-8859-1"?>)

I have placed the files I attached last time at 
http://ourcomments.org/Emacs/char/ and added some more comments. I have 
tried different ways to add the swedish character ä and all of them 
seems to result in a character with value 2276 beeing added. Even C-q 3 
4 4 RET results in this which surprises me. Should it be that way?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28 19:12     ` Lennart Borgman
@ 2005-09-29  8:43       ` Tomas Zerolo
  2005-09-29 13:34         ` Piet van Oostrum
  0 siblings, 1 reply; 20+ messages in thread
From: Tomas Zerolo @ 2005-09-29  8:43 UTC (permalink / raw)
  Cc: Kenichi Handa, emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 902 bytes --]

On Wed, Sep 28, 2005 at 09:12:45PM +0200, Lennart Borgman wrote:
> Lennart Borgman wrote:

[...]

> >I find this a bit confusing still. What character is entered by Emacs 
> >when I type ä on my swedish keyboard? When I look at the character ä 
> >in Emacs with (following-char) it in both test files returns 2276. Is 
> >that what I would expect in the iso-8859-1 test file? (It starts with 
> ><?xml version="1.0" encoding="iso-8859-1"?>)

Ah. You have to distinguish between Emacs's internal representation
(that's possibly the 2276 you mention), which doesn't change (al least
unless you try hard ;) and what is in the file (how Emacs writes or
interprets what it reads). You can change those things changing the
coding system (look for something like `multilingual environment').

You can see what coding system is active by doing
`M-x describe-coding-system´.

HTH
-- tomás

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-28 10:44 ` Juanma Barranquero
@ 2005-09-29 11:11   ` Mathias Dahl
  2005-09-29 13:28     ` Piet van Oostrum
  0 siblings, 1 reply; 20+ messages in thread
From: Mathias Dahl @ 2005-09-29 11:11 UTC (permalink / raw)


Juanma Barranquero <lekktu@gmail.com> writes:

> On 9/28/05, LENNART BORGMAN <lennart.borgman.073@student.lu.se> wrote:
>
>> I have run into a problem with swedish national characters in an
>> XHTML document. The header of the document is like this:
>>
>>   <?xml version="1.0" encoding="utf-8"?>
>>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>>
>> The swedish character ä looks like \344 in CVS Emacs (2005-09-23).
>
> Hmm. An XHTML document with encoding="utf-8" should not have
> "swedish national characters" in it, should it? Upon reading the
> file, Emacs will set its coding system to mule-utf-8, so it's no
> surprise than high-bit, non-valid utf8 byte sequences appear as
> \xxx...

I might be wrong here, but doesn't UTF-8 encode all characters in
Latin-1 (ISO 8859-1) exactly as they are *in* Latin-1 encoding?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-29 11:11   ` Mathias Dahl
@ 2005-09-29 13:28     ` Piet van Oostrum
  2005-09-29 13:52       ` Lennart Borgman
  0 siblings, 1 reply; 20+ messages in thread
From: Piet van Oostrum @ 2005-09-29 13:28 UTC (permalink / raw)


>>>>> Mathias Dahl <brakjoller@gmail.com> (MD) wrote:

>MD> I might be wrong here, but doesn't UTF-8 encode all characters in
>MD> Latin-1 (ISO 8859-1) exactly as they are *in* Latin-1 encoding?

No. Iso 8859-1 uses 1 byte for all characters, while UTF-8 uses two bytes
for those characters that are in iso-8859-1. What you probably mean is that
the Unicode value (code point) for each iso-8859-1 character is the same as
its encoding in iso-8859-1.

-- 
Piet van Oostrum <piet@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: piet@vanoostrum.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-29  8:43       ` Tomas Zerolo
@ 2005-09-29 13:34         ` Piet van Oostrum
  2005-09-29 14:02           ` Lennart Borgman
  0 siblings, 1 reply; 20+ messages in thread
From: Piet van Oostrum @ 2005-09-29 13:34 UTC (permalink / raw)


>>>>> tomas@tuxteam.de (Tomas Zerolo) (TZ) wrote:

>TZ> Ah. You have to distinguish between Emacs's internal representation
>TZ> (that's possibly the 2276 you mention), which doesn't change (al least
>TZ> unless you try hard ;) and what is in the file (how Emacs writes or
>TZ> interprets what it reads). You can change those things changing the
>TZ> coding system (look for something like `multilingual environment').

By default Emacs uses different internal representations for the "same"
character in different coding systems. So a iso-8859-1 "ä" is a different
thing than a utf-8 "ä". This difference will disappear when Emacs switches
to Unicode internally. For the time being the OP could use Unicode
unification, if his Emacs version is young enough. I have used this for
some years now without any problems. Maybe it solves the original problem.

(require 'ucs-tables)
(unify-8859-on-encoding-mode 1)
(unify-8859-on-decoding-mode 1)

-- 
Piet van Oostrum <piet@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: piet@vanoostrum.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-29 13:28     ` Piet van Oostrum
@ 2005-09-29 13:52       ` Lennart Borgman
  0 siblings, 0 replies; 20+ messages in thread
From: Lennart Borgman @ 2005-09-29 13:52 UTC (permalink / raw)
  Cc: emacs-devel

Piet van Oostrum wrote:

>>>>>>Mathias Dahl <brakjoller@gmail.com> (MD) wrote:
>>>>>>            
>>>>>>
>
>  
>
>>MD> I might be wrong here, but doesn't UTF-8 encode all characters in
>>MD> Latin-1 (ISO 8859-1) exactly as they are *in* Latin-1 encoding?
>>    
>>
>
>No. Iso 8859-1 uses 1 byte for all characters, while UTF-8 uses two bytes
>for those characters that are in iso-8859-1. What you probably mean is that
>the Unicode value (code point) for each iso-8859-1 character is the same as
>its encoding in iso-8859-1.
>  
>
This is not easy. What you say make it even more interesting why C-q 3 4 
4 RET is stored as 2276 (or what it was) in the XHTML files. How can 
that be? (For the context see my earlier mails.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-29 13:34         ` Piet van Oostrum
@ 2005-09-29 14:02           ` Lennart Borgman
  2005-09-30 22:15             ` Piet van Oostrum
  0 siblings, 1 reply; 20+ messages in thread
From: Lennart Borgman @ 2005-09-29 14:02 UTC (permalink / raw)
  Cc: emacs-devel

Piet van Oostrum wrote:

>>>>>>tomas@tuxteam.de (Tomas Zerolo) (TZ) wrote:
>>>>>>            
>>>>>>
>
>  
>
>>TZ> Ah. You have to distinguish between Emacs's internal representation
>>TZ> (that's possibly the 2276 you mention), which doesn't change (al least
>>TZ> unless you try hard ;) and what is in the file (how Emacs writes or
>>TZ> interprets what it reads). You can change those things changing the
>>TZ> coding system (look for something like `multilingual environment').
>>    
>>
>
>By default Emacs uses different internal representations for the "same"
>character in different coding systems. So a iso-8859-1 "ä" is a different
>thing than a utf-8 "ä". This difference will disappear when Emacs switches
>to Unicode internally. For the time being the OP could use Unicode
>unification, if his Emacs version is young enough. I have used this for
>some years now without any problems. Maybe it solves the original problem.
>
>(require 'ucs-tables)
>(unify-8859-on-encoding-mode 1)
>(unify-8859-on-decoding-mode 1)
>  
>
The values I have I have in CVS emacs.exe -Q is

  (featurep 'ucs-tables)  = t
  unify-8859-on-encoding-mode = t
  unify-8859-on-decoding-mode = nil

Though I do not understand what it means right now ;-)

Evaling (unify-8859-on-decoding-mode 1) does not change the behaviour of 
C-q 3 4 4 RET. It still enters a character that (following-char) reports as

  2276 (04344, 0x8e4)

I did not notice before that there only seem to be on bit that differs 
(see the second figure) - if that in some way matters.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-29 14:02           ` Lennart Borgman
@ 2005-09-30 22:15             ` Piet van Oostrum
  2005-09-30 23:02               ` Lennart Borgman
  0 siblings, 1 reply; 20+ messages in thread
From: Piet van Oostrum @ 2005-09-30 22:15 UTC (permalink / raw)


>>>>> Lennart Borgman <lennart.borgman.073@student.lu.se> (LB) wrote:

>LB> Evaling (unify-8859-on-decoding-mode 1) does not change the behaviour of
>LB> C-q 3 4 4 RET. It still enters a character that (following-char) reports as

>LB>   2276 (04344, 0x8e4)

That is just the internal representation of the character in Emacs. It's
not important. What matters is what Emacs writes to your file. When you
write out utf-8 (for example by giving the command
(set-buffer-file-coding-system 'utf-8) it will write out C3 A4, 
whereas if you use (set-buffer-file-coding-system 'latin-1) it will write
out E4.
-- 
Piet van Oostrum <piet@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: piet@vanoostrum.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-30 22:15             ` Piet van Oostrum
@ 2005-09-30 23:02               ` Lennart Borgman
  2005-10-01  4:29                 ` Tomas Zerolo
  2005-10-01 11:22                 ` Piet van Oostrum
  0 siblings, 2 replies; 20+ messages in thread
From: Lennart Borgman @ 2005-09-30 23:02 UTC (permalink / raw)
  Cc: emacs-devel

Piet van Oostrum wrote:

>>LB> Evaling (unify-8859-on-decoding-mode 1) does not change the behaviour of
>>LB> C-q 3 4 4 RET. It still enters a character that (following-char) reports as
>>    
>>
>
>  
>
>>LB>   2276 (04344, 0x8e4)
>>    
>>
>
>That is just the internal representation of the character in Emacs. It's
>not important. What matters is what Emacs writes to your file. When you
>write out utf-8 (for example by giving the command
>(set-buffer-file-coding-system 'utf-8) it will write out C3 A4, 
>whereas if you use (set-buffer-file-coding-system 'latin-1) it will write
>out E4.
>  
>
So you mean that at a - what should I call it? - "text semantic level" 
the utf-8 char and the latin-1 char has the same meaning?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-30 23:02               ` Lennart Borgman
@ 2005-10-01  4:29                 ` Tomas Zerolo
  2005-10-01 11:22                 ` Piet van Oostrum
  1 sibling, 0 replies; 20+ messages in thread
From: Tomas Zerolo @ 2005-10-01  4:29 UTC (permalink / raw)
  Cc: Piet van Oostrum, emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 1807 bytes --]

On Sat, Oct 01, 2005 at 01:02:31AM +0200, Lennart Borgman wrote:
> Piet van Oostrum wrote:
[...]
> >That is just the internal representation of the character in Emacs. It's
> >not important. What matters is what Emacs writes to your file. When you
> >write out utf-8 (for example by giving the command
[...]
> So you mean that at a - what should I call it? - "text semantic level" 
> the utf-8 char and the latin-1 char has the same meaning?

Yes. You put that nicely. The *character* (a dieresis) stays the same.
The *representation* (loosely referred to as `encoding') changes.

I said loosely, because on more complex things as utf-8 there are
actually two layers: the `character set', mapping each character to an
integer (aka `code point', which in this case would be UNICODE or
ISO-10646, which nowadays are equivalent), and the representation in a
file, which may be utf-8 (most common), ucs-16 or whatnot.

Now the advantage of utf-8: it is a variable-width encoding, and uses up
just one byte for one ASCII character (on ASCII it uses the same code
points). So you can interpret an ASCII file ``as-is'' as an utf-8 file.

For higher characters (the ones, for example with codes >127 in
iso-8859-1 (aka Latin1)), you need more than one byte in utf-8. AFAIK,
up to 6 bytes, but don't take that too seriously.

The disadvantage is: it is a variable-width encoding, so you have to
process a text sequentially, byte-for-byte to get the character
boundaries right (it's designed to re-synchronize gracefully, though).

If you want the whole story (on UNICODE, ISO10646, UTF8), see here:

  <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

(very recommended). From the perspective of a web slave, see:

  <http://www.w3.org/TR/REC-html40/charset.html>

HTH
-- tomas

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Problem with national characters in XHTML
  2005-09-30 23:02               ` Lennart Borgman
  2005-10-01  4:29                 ` Tomas Zerolo
@ 2005-10-01 11:22                 ` Piet van Oostrum
  1 sibling, 0 replies; 20+ messages in thread
From: Piet van Oostrum @ 2005-10-01 11:22 UTC (permalink / raw)


>>>>> Lennart Borgman <lennart.borgman.073@student.lu.se> (LB) wrote:

>LB> So you mean that at a - what should I call it? - "text semantic level" the
>LB> utf-8 char and the latin-1 char has the same meaning?

With Unicode unification yes. Without (I think - input) unification a
iso-8851-1 "ä" and a iso-8859-9 "ä" and others all use different codes,
which can be quite annoying.
-- 
Piet van Oostrum <piet@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: piet@vanoostrum.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2005-10-01 11:22 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-28  8:29 Problem with national characters in XHTML LENNART BORGMAN
2005-09-28 10:19 ` Jason Rumney
2005-09-28 10:22 ` David Hansen
2005-09-28 10:22 ` Paul Pogonyshev
2005-09-28 10:41 ` Tomas Zerolo
2005-09-28 10:44 ` Juanma Barranquero
2005-09-29 11:11   ` Mathias Dahl
2005-09-29 13:28     ` Piet van Oostrum
2005-09-29 13:52       ` Lennart Borgman
2005-09-28 11:09 ` Kenichi Handa
2005-09-28 14:05   ` Lennart Borgman
2005-09-28 19:12     ` Lennart Borgman
2005-09-29  8:43       ` Tomas Zerolo
2005-09-29 13:34         ` Piet van Oostrum
2005-09-29 14:02           ` Lennart Borgman
2005-09-30 22:15             ` Piet van Oostrum
2005-09-30 23:02               ` Lennart Borgman
2005-10-01  4:29                 ` Tomas Zerolo
2005-10-01 11:22                 ` Piet van Oostrum
  -- strict thread matches above, loose matches on Subject: below --
2005-09-28 11:08 LENNART BORGMAN

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).