From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Karl Eichwalder Newsgroups: gmane.emacs.devel Subject: Re: Reporting UTF-8 related problems? Date: Tue, 30 Jul 2002 20:58:32 +0200 Sender: emacs-devel-admin@gnu.org Message-ID: References: <2110-Sun28Jul2002212621+0300-eliz@is.elta.co.il> <200207290518.OAA04004@etlken.m17n.org> <200207300522.OAA05828@etlken.m17n.org> <200207300711.QAA05993@etlken.m17n.org> NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: main.gmane.org 1028055839 27574 127.0.0.1 (30 Jul 2002 19:03:59 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Tue, 30 Jul 2002 19:03:59 +0000 (UTC) Cc: eliz@is.elta.co.il, emacs-devel@gnu.org, Andreas Schwab Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.33 #1 (Debian)) id 17ZcHm-0007Ad-00 for ; Tue, 30 Jul 2002 21:03:58 +0200 Original-Received: from fencepost.gnu.org ([199.232.76.164]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 17ZcZc-0005VO-00 for ; Tue, 30 Jul 2002 21:22:25 +0200 Original-Received: from localhost ([127.0.0.1] helo=fencepost.gnu.org) by fencepost.gnu.org with esmtp (Exim 3.35 #1 (Debian)) id 17ZcIC-0000CY-00; Tue, 30 Jul 2002 15:04:24 -0400 Original-Received: from dns.franken.de ([193.175.24.33] helo=elvis.franken.de) by fencepost.gnu.org with esmtp (Exim 3.35 #1 (Debian)) id 17ZcHS-0000A2-00 for ; Tue, 30 Jul 2002 15:03:38 -0400 Original-Received: from uucp by elvis.franken.de with local-rmail (Exim 3.22 #1) id 17ZcHJ-0005M3-00; Tue, 30 Jul 2002 21:03:29 +0200 Original-Received: by tux.gnu.franken.de (Postfix, from userid 270) id 1E087A43A1; Tue, 30 Jul 2002 20:58:42 +0200 (CEST) Original-To: Kenichi Handa In-Reply-To: <200207300711.QAA05993@etlken.m17n.org> (Kenichi Handa's message of "Tue, 30 Jul 2002 16:11:18 +0900 (JST)") Original-Lines: 100 User-Agent: Gnus/5.090006 (Oort Gnus v0.06) Emacs/21.3.50 (i686-pc-linux-gnu) Errors-To: emacs-devel-admin@gnu.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.0.11 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.devel:6181 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:6181 --=-=-= Content-Type: text/plain; charset=macintosh Content-Transfer-Encoding: quoted-printable Kenichi Handa writes: > „Die Familie Schroffenstein“ > > I thought that the notation &#NUMBER is for transmitting > Unicode character of code NUMBER. But, 132 and 147 are > control codes in Unicode, not any kind of quotings. &#NUMBERs are so called "character references"; the SGML declaration defines which are allowed. For HTML you must consult the html.d[e]?cl file. The crucial section is (HTML 2): BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" DESCSET 128 32 UNUSED 160 96 32 This basically means: € to Ÿ are unused. The same applies for HTML 4 (and later fpr XML resp. XHTML): BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 32 UNUSED [...] To make the SGML parser happy you can provide a changed declaration: BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 4 UNUSED 132 1 "My rising double quote left (low)" 133 14 UNUSED 147 1 "My rising double quote right (high)" 148 16 UNUSED [...] Untested, and the result is invalid HTML. If they would announce a proper HTTP header, it could be okay: Content-Type: text/html; charset=3Dwindows-1252 Andreas Schwab writes: > The numbers are supposed to be ISO 8859-1 characters codes. I'd guess the > page has been written with some broken (a.k.a. W*nd*ws) software (the use > of *.htm makes this apparent). Yes, they have "interesting" guidelines online... Kenichi Handa writes: > Ah, I see. I found that windows-125X maps 132 and 147 to > U+201E and U+201C. So, perhaps those systems (galeon and > lynx) parse them as U+201E and U+201C. Anyway, how to > encode them in X selection is their problem and Emacs can't > do anything about it. Yes, but once in the X selection I'd like to see Emacs honor them. The spacing problem also occurs when I try to cut and paste from Markus Kuhn's demo file (http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt): =A5 =E2deutsche --=-=-= Content-Type: text/plain; charset=euc-jp Content-Transfer-Encoding: base64 ocYg --=-=-= Content-Type: text/plain; charset=macintosh Content-Transfer-Encoding: quoted-printable =E3Anf --=-=-= Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable =FChrungszeichen --=-=-= Content-Type: text/plain; charset=euc-jp Content-Transfer-Encoding: quoted-printable =A1=C8 When I insert (C-x RET c utf-8 RET C-x C-f UTF-8-demo.txt RET), things are correctly displayed (the characters are different): --=-=-= Content-Type: text/plain; charset=macintosh Content-Transfer-Encoding: base64 DQqlIOJkZXV0c2NoZdQg40FuZg== --=-=-= Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable =FChrungszeichen --=-=-= Content-Type: text/plain; charset=macintosh Content-Transfer-Encoding: quoted-printable =D2 Cut and paste both these examples from Emacs (this mail buffer) to a UTF-8 xterm doesn't work neither; instead of the quotes I see "-1" and garbage. I hope the examples will go through. --=20 ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) --=-=-=--