From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Aidan Kehoe Newsgroups: gmane.emacs.devel,gmane.emacs.pretest.bugs Subject: Re: [PATCH] Unicode Lisp reader escapes. Date: Thu, 15 Jun 2006 20:38:06 +0200 Message-ID: <17553.43278.718379.863167@parhasard.net> References: <17491.34779.959316.484740@parhasard.net> <17492.29148.246942.842300@parhasard.net> <8764kkawsf.fsf@jurta.org> <87vesi6nh1.fsf@jurta.org> <878xp8g2a9.fsf@jurta.org> <17537.54719.354843.89030@parhasard.net> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1150396529 15287 80.91.229.2 (15 Jun 2006 18:35:29 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 15 Jun 2006 18:35:29 +0000 (UTC) Cc: emacs-pretest-bug@gnu.org, emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Jun 15 20:35:26 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1FqwgT-00064O-5j for ged-emacs-devel@m.gmane.org; Thu, 15 Jun 2006 20:35:13 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FqwgS-0006PM-Pe for ged-emacs-devel@m.gmane.org; Thu, 15 Jun 2006 14:35:12 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1FqwgI-0006Mz-3Q for emacs-devel@gnu.org; Thu, 15 Jun 2006 14:35:02 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1FqwgF-0006Lm-AW for emacs-devel@gnu.org; Thu, 15 Jun 2006 14:35:00 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FqwgF-0006Lj-50; Thu, 15 Jun 2006 14:34:59 -0400 Original-Received: from [66.111.49.30] (helo=icarus.asclepian.ie) by monty-python.gnu.org with esmtp (Exim 4.52) id 1Fqwpe-0003jV-4F; Thu, 15 Jun 2006 14:44:42 -0400 Original-Received: by icarus.asclepian.ie (Postfix, from userid 1003) id EC4AB8008C; Thu, 15 Jun 2006 19:34:53 +0100 (IST) Original-To: Eli Zaretskii In-Reply-To: X-Mailer: VM 7.17 under 21.5 (beta26) "endive" (+CVS-20060512) XEmacs Lucid X-NS5-file-as-sent: t X-Echelon-distraction: Peking Ft. Meade Cocaine bomb jihad $400 million in gold bullion X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:55916 gmane.emacs.pretest.bugs:12525 Archived-At: > if (EQ(Qnil, lisp_char)) > { > /* This is ugly and horrible and trashes the user's data. */ > XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201,=20 > 34 + 128, 46 + 128)); > return i; > } >=20 > What is this special Katakana character, and why are we producing it? Firstly, thank you for posing the question; the character intended was no= t a member of JISX0201 at all, rather of JISX0208. I yanked the wrong charset identifier from charset.h when porting the code from XEmacs. The patch be= low addresses this.=20 (make-char 'japanese-jisx0208 34 46) gives U+3013 GETA MARK, a character = in JISX 0208 that is used to represent unknown or corrupted data. The Unicode-specific equivalent is U+FFFD REPLACEMENT CHARACTER. I used the G= ETA MARK because I was certain it would be available in Mule and it is equivalent. It turns out that (make-char 'mule-unicode-e000-ffff 117 61) gives U+FFFD, so it might be worthwhile to replace that.=20 > Is it to trigger an "Invalid character" message, or is something else > going on here? It doesn=E2=80=99t actually trigger a message, it displays a character to= be interpreted as =E2=80=9Cthe character couldn=E2=80=99t be interpreted.=E2= =80=9D My feeling is that the syntax should be close in its behaviour to what th= e coding systems do, and when the coding systems see a code point that is valid but that they can=E2=80=99t interpret, they trash the user=E2=80=99= s data. (Or do something totally mad like transform invalid UTF-16 to invalid UTF-8!?) src/ChangeLog addition: 2006-06-14 Aidan Kehoe * lread.c (read_escape): Change charset_katakana_jisx0201 to charset_jisx0208 as it should have been in the first place, since we intended U+3013 GETA MARK.=20 =09 GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c Index: src/lread.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.353 diff -u -u -r1.353 lread.c --- src/lread.c 9 Jun 2006 18:22:30 -0000 1.353 +++ src/lread.c 14 Jun 2006 06:57:49 -0000 @@ -1967,7 +1967,7 @@ if (EQ(Qnil, lisp_char)) { /* This is ugly and horrible and trashes the user's data. */ - XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, + XSETFASTINT (i, MAKE_CHAR (charset_jisx0208, 34 + 128, 46 + 128)); return i; } --=20 Aidan Kehoe, http://www.parhasard.net/