Kenichi Handa <handa@etl.go.jp> writes:

> 	&#132;Die Familie Schroffenstein&#147
>
> I thought that the notation &#NUMBER is for transmitting
> Unicode character of code NUMBER.  But, 132 and 147 are
> control codes in Unicode, not any kind of quotings.

&#NUMBERs are so called "character references"; the SGML declaration
defines which are allowed.  For HTML you must consult the html.d[e]?cl
file.  The crucial section is (HTML 2):

     BASESET   "ISO Registration Number 100//CHARSET
                ECMA-94 Right Part of
                Latin Alphabet Nr. 1//ESC 2/13 4/1"

         DESCSET  128  32   UNUSED
                  160  96    32

This basically means: &#128 to &#159 are unused.  The same applies for
HTML 4 (and later fpr XML resp. XHTML):

          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 [...]

To make the SGML parser happy you can provide a changed declaration:

          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     4      UNUSED
                 132     1      "My rising double quote left (low)"
                 133     14     UNUSED
                 147     1      "My rising double quote right (high)"
                 148     16     UNUSED
                 [...]

Untested, and the result is invalid HTML.  If they would announce a
proper HTTP header, it could be okay:

Content-Type: text/html; charset=windows-1252


Andreas Schwab <schwab@suse.de> writes:

> The numbers are supposed to be ISO 8859-1 characters codes.  I'd guess the
> page has been written with some broken (a.k.a. W*nd*ws) software (the use
> of *.htm makes this apparent).

Yes, they have "interesting" guidelines online...

Kenichi Handa <handa@etl.go.jp> writes:

> Ah, I see.  I found that windows-125X maps 132 and 147 to
> U+201E and U+201C.  So, perhaps those systems (galeon and
> lynx) parse them as U+201E and U+201C.  Anyway, how to
> encode them in X selection is their problem and Emacs can't
> do anything about it.

Yes, but once in the X selection I'd like to see Emacs honor them.

The spacing problem also occurs when I try to cut and paste from Markus
Kuhn's demo file
(http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt):

Ą âdeutsche