From: Karl Eichwalder <keichwa@gmx.net>
Cc: eliz@is.elta.co.il, emacs-devel@gnu.org, Andreas Schwab <schwab@suse.de>
Subject: Re: Reporting UTF-8 related problems?
Date: Tue, 30 Jul 2002 20:58:32 +0200 [thread overview]
Message-ID: <sh65yxniuf.fsf@tux.gnu.franken.de> (raw)
In-Reply-To: <200207300711.QAA05993@etlken.m17n.org> (Kenichi Handa's message of "Tue, 30 Jul 2002 16:11:18 +0900 (JST)")
[-- Attachment #1: Type: text/plain, Size: 3102 bytes --]
Kenichi Handa <handa@etl.go.jp> writes:
> „Die Familie Schroffenstein“
>
> I thought that the notation &#NUMBER is for transmitting
> Unicode character of code NUMBER. But, 132 and 147 are
> control codes in Unicode, not any kind of quotings.
&#NUMBERs are so called "character references"; the SGML declaration
defines which are allowed. For HTML you must consult the html.d[e]?cl
file. The crucial section is (HTML 2):
BASESET "ISO Registration Number 100//CHARSET
ECMA-94 Right Part of
Latin Alphabet Nr. 1//ESC 2/13 4/1"
DESCSET 128 32 UNUSED
160 96 32
This basically means: € to Ÿ are unused. The same applies for
HTML 4 (and later fpr XML resp. XHTML):
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
[...]
To make the SGML parser happy you can provide a changed declaration:
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 4 UNUSED
132 1 "My rising double quote left (low)"
133 14 UNUSED
147 1 "My rising double quote right (high)"
148 16 UNUSED
[...]
Untested, and the result is invalid HTML. If they would announce a
proper HTTP header, it could be okay:
Content-Type: text/html; charset=windows-1252
Andreas Schwab <schwab@suse.de> writes:
> The numbers are supposed to be ISO 8859-1 characters codes. I'd guess the
> page has been written with some broken (a.k.a. W*nd*ws) software (the use
> of *.htm makes this apparent).
Yes, they have "interesting" guidelines online...
Kenichi Handa <handa@etl.go.jp> writes:
> Ah, I see. I found that windows-125X maps 132 and 147 to
> U+201E and U+201C. So, perhaps those systems (galeon and
> lynx) parse them as U+201E and U+201C. Anyway, how to
> encode them in X selection is their problem and Emacs can't
> do anything about it.
Yes, but once in the X selection I'd like to see Emacs honor them.
The spacing problem also occurs when I try to cut and paste from Markus
Kuhn's demo file
(http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt):
• ‚deutsche
[-- Attachment #2: Type: text/plain, Size: 3 bytes --]
‘
[-- Attachment #3: Type: text/plain, Size: 4 bytes --]
„Anf
[-- Attachment #4: Type: text/plain, Size: 14 bytes --]
ührungszeichen
[-- Attachment #5: Type: text/plain, Size: 135 bytes --]
“
When I insert (C-x RET c utf-8 RET C-x C-f UTF-8-demo.txt RET), things
are correctly displayed (the characters are different):
[-- Attachment #6: Type: text/plain, Size: 19 bytes --]
• ‚deutsche‘ „Anf
[-- Attachment #7: Type: text/plain, Size: 14 bytes --]
ührungszeichen
[-- Attachment #8: Type: text/plain, Size: 475 bytes --]
“
Cut and paste both these examples from Emacs (this mail buffer) to a
UTF-8 xterm doesn't work neither; instead of the quotes I see "-1" and
garbage.
I hope the examples will go through.
--
ke@suse.de (work) / keichwa@gmx.net (home): |
http://www.suse.de/~ke/ | ,__o
Free Translation Project: | _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*)
next prev parent reply other threads:[~2002-07-30 18:58 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-07-28 16:14 Reporting UTF-8 related problems? Karl Eichwalder
2002-07-28 18:23 ` Eli Zaretskii
2002-07-28 18:26 ` Eli Zaretskii
2002-07-29 5:18 ` Kenichi Handa
2002-07-29 5:37 ` Kenichi Handa
2002-07-29 15:35 ` Karl Eichwalder
2002-07-30 5:22 ` Kenichi Handa
2002-07-30 6:01 ` Karl Eichwalder
2002-07-30 7:11 ` Kenichi Handa
2002-07-30 7:57 ` Andreas Schwab
2002-07-30 8:30 ` Kenichi Handa
2002-07-30 18:58 ` Karl Eichwalder [this message]
2002-07-30 19:51 ` Karl Eichwalder
2002-07-31 2:59 ` Karl Eichwalder
2002-07-31 12:26 ` Kenichi Handa
2002-07-31 16:29 ` Karl Eichwalder
2002-08-01 5:18 ` Eli Zaretskii
2002-08-14 1:21 ` Kenichi Handa
2002-11-03 20:21 ` Karl Eichwalder
2002-11-04 4:56 ` Karl Eichwalder
2002-07-29 17:29 ` Richard Stallman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=sh65yxniuf.fsf@tux.gnu.franken.de \
--to=keichwa@gmx.net \
--cc=eliz@is.elta.co.il \
--cc=emacs-devel@gnu.org \
--cc=schwab@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.