From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Aidan Kehoe <kehoea@parhasard.net>
Newsgroups: gmane.emacs.devel
Subject: Re: [PATCH] Unicode Lisp reader escapes
Date: Sat, 6 May 2006 19:26:02 +0200
Message-ID: <17500.56362.991971.290576@parhasard.net>
References: <17491.34779.959316.484740@parhasard.net>
	<E1FaobM-0005qh-00@etlken> <ufyjsemrn.fsf@gnu.org>
	<E1Fb7ai-0002Yb-00@etlken> <uy7xjcx5s.fsf@gnu.org>
	<87odyfnqcj.fsf-monnier+emacs@gnu.org> <uvesnc6cw.fsf@gnu.org>
	<E1Fbedv-0002Cd-3W@fencepost.gnu.org>
	<17498.11949.75640.41779@parhasard.net>
	<E1Fc5cK-0005oc-91@fencepost.gnu.org>
	<17499.42393.597479.46003@parhasard.net>
	<17499.44569.825194.210709@parhasard.net>
	<E1FcNj3-0004zC-50@fencepost.gnu.org>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: sea.gmane.org 1146936289 4750 80.91.229.2 (6 May 2006 17:24:49 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Sat, 6 May 2006 17:24:49 +0000 (UTC)
Cc: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat May 06 19:24:45 2006
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1FcQVy-0002L6-B3
	for ged-emacs-devel@m.gmane.org; Sat, 06 May 2006 19:24:23 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1FcQVx-0006dH-FY
	for ged-emacs-devel@m.gmane.org; Sat, 06 May 2006 13:24:21 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1FcQVl-0006bh-VZ
	for emacs-devel@gnu.org; Sat, 06 May 2006 13:24:10 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1FcQVl-0006aW-4K
	for emacs-devel@gnu.org; Sat, 06 May 2006 13:24:09 -0400
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1FcQVk-0006a3-Ss
	for emacs-devel@gnu.org; Sat, 06 May 2006 13:24:08 -0400
Original-Received: from [66.111.49.30] (helo=icarus.asclepian.ie)
	by monty-python.gnu.org with esmtp (Exim 4.52)
	id 1FcQW9-00031V-7w; Sat, 06 May 2006 13:24:33 -0400
Original-Received: by icarus.asclepian.ie (Postfix, from userid 1003)
	id 1D1E78008E; Sat,  6 May 2006 18:24:05 +0100 (IST)
Original-To: rms@gnu.org
In-Reply-To: <E1FcNj3-0004zC-50@fencepost.gnu.org>
X-Mailer: VM 7.17 under 21.5  (beta26) "endive" (+CVS-20060429) XEmacs Lucid
X-NS5-file-as-sent: t
X-Echelon-distraction: Mena Ft. Meade FSF Monica Lewinsky militia Delta Force 
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:54010
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/54010>


 Ar an s=C3=A9i=C3=BA l=C3=A1 de m=C3=AD Bealtaine, scr=C3=ADobh Richard =
Stallman>:=20

 >     All of these are incompatibilities on the Emacs Lisp side; except =
for
 >     the Unicode escapes, a C programmer can use any C escape desired i=
n
 >     Emacs Lisp.
 >=20
 > That being so, I think it is useful to keep that true, and implement
 > \u and \U in a way that is compatible with C.
 >=20
 > We could install this now if someone writes changes for etc/NEWS and
 > the Lisp manual, as well as the code.

Okay. I=E2=80=99ve already signed papers; the patch below includes update=
s
to the NEWS file, the code and the Lisp manual.=20

One mostly open question, which the below patch takes a clear stand on, i=
s
whether it is acceptable to call decode-char (which is implemented in Lis=
p)
from the Lisp reader. I share Stefan Monnier=E2=80=99s judgement on this:

=E2=80=9CI'd vote to keep the code in elisp.  After all, it's there, it w=
orks, and
as mentioned: there's no evidence that the decoding time of \u escapes it
ever going to need to be fast.  And it'll become fast in Emacs-unicode
anyway, so it doesn't seem to be worth the trouble.=E2=80=9D=20

I have no objection to implementing decode-char in C in general; it would
mean that handle_one_event in xterm.c could be made much more robust, for
example. It currently is the case that Unicode keysyms are handled
inconsistently with the Unicode coding systems and that code points above
#xFFFF are simply dropped, it doesn=E2=80=99t even try to convert them to=
 Emacs
characters. But integrating it into Emacs for the sake of this patch seem=
s
too much potential instability for too little benefit.

Another thing; if this patch is to be integrated, there is some Lisp in t=
he
source tree using \u in strings (incorrectly) that will need to be change=
d
to use \\u.

etc/ChangeLog addition:

2006-05-06  Aidan Kehoe  <kehoea@parhasard.net>

	* NEWS:
	Describe the Unicode string and character escape
=09

lispref/ChangeLog addition:

2006-05-06  Aidan Kehoe  <kehoea@parhasard.net>

	* objects.texi (Character Type):
        Describe the Unicode character escape syntax; \uABCD or \U00ABCDE=
F=20
        specifies Unicode characters U+ABCD and U+ABCDEF respectively. =20


src/ChangeLog addition:

2006-05-06  Aidan Kehoe  <kehoea@parhasard.net>

	* lread.c (read_escape):
        Provide a Unicode character escape syntax; \u followed by exactly=
=20
        four or \U followed by exactly eight hex digits in a comment or=20
        string is read as a Unicode character with that code point. =20
=09

GNU Emacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c lispref/objects.texi etc/NEWS

Index: etc/NEWS
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /sources/emacs/emacs/etc/NEWS,v
retrieving revision 1.1337
diff -u -u -r1.1337 NEWS
--- etc/NEWS	2 May 2006 01:47:57 -0000	1.1337
+++ etc/NEWS	6 May 2006 16:57:54 -0000
@@ -3772,6 +3772,13 @@
 been declared obsolete.
=20
 +++
+*** New syntax: \uXXXX and \UXXXXXXXX specify Unicode code points in hex=
.
+Use "\u0428" to specify a string consisting of CYRILLIC CAPITAL LETTER S=
HA,
+or "\U0001D6E2" to specify one consisting of MATHEMATICAL ITALIC CAPITAL
+ALPHA (the latter is greater than #xFFFF and thus needs the longer
+syntax). Also available for characters.=20
+
++++
 ** Displaying warnings to the user.
=20
 See the functions `warn' and `display-warning', or the Lisp Manual.
Index: lispref/objects.texi
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /sources/emacs/emacs/lispref/objects.texi,v
retrieving revision 1.53
diff -u -u -r1.53 objects.texi
--- lispref/objects.texi	1 May 2006 15:05:48 -0000	1.53
+++ lispref/objects.texi	6 May 2006 16:57:56 -0000
@@ -431,6 +431,20 @@
 bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper.
 @end ifnottex
=20
+@cindex unicode character escape
+  Emacs provides a syntax for specifying characters by their Unicode cod=
e
+points.  @code{?\uABCD} represents a character that maps to the code
+point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files,
+Unicode-oriented fonts, etc.).  There is a slightly different syntax for
+specifying characters with code points above @code{#xFFFF};
+@code{\U00ABCDEF} represents an Emacs character that maps to the code
+point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs
+character exists.
+
+  Unlike in some other languages, while this syntax is available for
+character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
 @cindex @samp{\} in character constant
 @cindex backslash in character constant
 @cindex octal character code
Index: src/lread.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -u -r1.350 lread.c
--- src/lread.c	27 Feb 2006 02:04:35 -0000	1.350
+++ src/lread.c	6 May 2006 16:57:57 -0000
@@ -1743,6 +1743,9 @@
      int *byterep;
 {
   register int c =3D READCHAR;
+  /* \u allows up to four hex digits, \U up to eight. Default to the
+     behaviour for \u, and change this value in the case that \U is seen=
. */
+  int unicode_hex_count =3D 4;
=20
   *byterep =3D 0;
=20
@@ -1907,6 +1910,48 @@
 	return i;
       }
=20
+    case 'U':
+      /* Post-Unicode-2.0: Up to eight hex chars */
+      unicode_hex_count =3D 8;
+    case 'u':
+
+      /* A Unicode escape. We only permit them in strings and characters=
,
+	 not arbitrarily in the source code as in some other languages. */
+      {
+	int i =3D 0;
+	int count =3D 0;
+	Lisp_Object lisp_char;
+	while (++count <=3D unicode_hex_count)
+	  {
+	    c =3D READCHAR;
+	    /* isdigit(), isalpha() may be locale-specific, which we don't
+	       want. */
+	    if      (c >=3D '0' && c <=3D '9')  i =3D (i << 4) + (c - '0');
+	    else if (c >=3D 'a' && c <=3D 'f')  i =3D (i << 4) + (c - 'a') + 10=
;
+            else if (c >=3D 'A' && c <=3D 'F')  i =3D (i << 4) + (c - 'A=
') + 10;
+	    else
+	      {
+		error ("Non-hex digit used for Unicode escape");
+		break;
+	      }
+	  }
+
+	lisp_char =3D call2(intern("decode-char"), intern("ucs"),
+			  make_number(i));
+
+	if (EQ(Qnil, lisp_char))
+	  {
+	    /* This is ugly and horrible and trashes the user's data. */
+	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201,=20
+				       34 + 128, 46 + 128));
+            return i;
+	  }
+	else
+	  {
+	    return XFASTINT (lisp_char);
+	  }
+      }
+
     default:
       if (BASE_LEADING_CODE_P (c))
 	c =3D read_multibyte (c, readcharfun);

--=20
Aidan Kehoe, http://www.parhasard.net/