unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Karl Eichwalder <keichwa@gmx.net>
Cc: eliz@is.elta.co.il, emacs-devel@gnu.org, Andreas Schwab <schwab@suse.de>
Subject: Re: Reporting UTF-8 related problems?
Date: Tue, 30 Jul 2002 20:58:32 +0200	[thread overview]
Message-ID: <sh65yxniuf.fsf@tux.gnu.franken.de> (raw)
In-Reply-To: <200207300711.QAA05993@etlken.m17n.org> (Kenichi Handa's message of "Tue, 30 Jul 2002 16:11:18 +0900 (JST)")

[-- Attachment #1: Type: text/plain, Size: 3102 bytes --]

Kenichi Handa <handa@etl.go.jp> writes:

> 	&#132;Die Familie Schroffenstein&#147
>
> I thought that the notation &#NUMBER is for transmitting
> Unicode character of code NUMBER.  But, 132 and 147 are
> control codes in Unicode, not any kind of quotings.

&#NUMBERs are so called "character references"; the SGML declaration
defines which are allowed.  For HTML you must consult the html.d[e]?cl
file.  The crucial section is (HTML 2):

     BASESET   "ISO Registration Number 100//CHARSET
                ECMA-94 Right Part of
                Latin Alphabet Nr. 1//ESC 2/13 4/1"

         DESCSET  128  32   UNUSED
                  160  96    32

This basically means: &#128 to &#159 are unused.  The same applies for
HTML 4 (and later fpr XML resp. XHTML):

          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 [...]

To make the SGML parser happy you can provide a changed declaration:

          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     4      UNUSED
                 132     1      "My rising double quote left (low)"
                 133     14     UNUSED
                 147     1      "My rising double quote right (high)"
                 148     16     UNUSED
                 [...]

Untested, and the result is invalid HTML.  If they would announce a
proper HTTP header, it could be okay:

Content-Type: text/html; charset=windows-1252


Andreas Schwab <schwab@suse.de> writes:

> The numbers are supposed to be ISO 8859-1 characters codes.  I'd guess the
> page has been written with some broken (a.k.a. W*nd*ws) software (the use
> of *.htm makes this apparent).

Yes, they have "interesting" guidelines online...

Kenichi Handa <handa@etl.go.jp> writes:

> Ah, I see.  I found that windows-125X maps 132 and 147 to
> U+201E and U+201C.  So, perhaps those systems (galeon and
> lynx) parse them as U+201E and U+201C.  Anyway, how to
> encode them in X selection is their problem and Emacs can't
> do anything about it.

Yes, but once in the X selection I'd like to see Emacs honor them.

The spacing problem also occurs when I try to cut and paste from Markus
Kuhn's demo file
(http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt):

• ‚deutsche

[-- Attachment #2: Type: text/plain, Size: 3 bytes --][-- Attachment #3: Type: text/plain, Size: 4 bytes --]

„Anf

[-- Attachment #4: Type: text/plain, Size: 14 bytes --]

ührungszeichen

[-- Attachment #5: Type: text/plain, Size: 135 bytes --]

“

When I insert (C-x RET c utf-8 RET C-x C-f UTF-8-demo.txt RET), things
are correctly displayed (the characters are different):

[-- Attachment #6: Type: text/plain, Size: 19 bytes --]


• ‚deutsche‘ „Anf

[-- Attachment #7: Type: text/plain, Size: 14 bytes --]

ührungszeichen

[-- Attachment #8: Type: text/plain, Size: 475 bytes --]

“

Cut and paste both these examples from Emacs (this mail buffer) to a
UTF-8 xterm doesn't work neither; instead of the quotes I see "-1" and
garbage.

I hope the examples will go through.

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

  parent reply	other threads:[~2002-07-30 18:58 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-07-28 16:14 Reporting UTF-8 related problems? Karl Eichwalder
2002-07-28 18:23 ` Eli Zaretskii
2002-07-28 18:26 ` Eli Zaretskii
2002-07-29  5:18   ` Kenichi Handa
2002-07-29  5:37     ` Kenichi Handa
2002-07-29 15:35     ` Karl Eichwalder
2002-07-30  5:22       ` Kenichi Handa
2002-07-30  6:01         ` Karl Eichwalder
2002-07-30  7:11           ` Kenichi Handa
2002-07-30  7:57             ` Andreas Schwab
2002-07-30  8:30               ` Kenichi Handa
2002-07-30 18:58             ` Karl Eichwalder [this message]
2002-07-30 19:51               ` Karl Eichwalder
2002-07-31  2:59               ` Karl Eichwalder
2002-07-31 12:26               ` Kenichi Handa
2002-07-31 16:29                 ` Karl Eichwalder
2002-08-01  5:18                 ` Eli Zaretskii
2002-08-14  1:21                   ` Kenichi Handa
2002-11-03 20:21                     ` Karl Eichwalder
2002-11-04  4:56                       ` Karl Eichwalder
2002-07-29 17:29   ` Richard Stallman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=sh65yxniuf.fsf@tux.gnu.franken.de \
    --to=keichwa@gmx.net \
    --cc=eliz@is.elta.co.il \
    --cc=emacs-devel@gnu.org \
    --cc=schwab@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).