From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Aidan Kehoe <kehoea@parhasard.net>
Newsgroups: gmane.emacs.devel
Subject: Re: [PATCH] Unicode Lisp reader escapes
Date: Sun, 30 Apr 2006 10:14:20 +0200
Message-ID: <17492.29148.246942.842300@parhasard.net>
References: <17491.34779.959316.484740@parhasard.net> 
	<E1Fa2EV-0003pC-1U@fencepost.gnu.org>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: sea.gmane.org 1146384794 12357 80.91.229.2 (30 Apr 2006 08:13:14 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Sun, 30 Apr 2006 08:13:14 +0000 (UTC)
Cc: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 30 10:13:11 2006
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1Fa739-0000SS-1W
	for ged-emacs-devel@m.gmane.org; Sun, 30 Apr 2006 10:13:03 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1Fa738-00039d-BA
	for ged-emacs-devel@m.gmane.org; Sun, 30 Apr 2006 04:13:02 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Fa72n-000381-76
	for emacs-devel@gnu.org; Sun, 30 Apr 2006 04:12:41 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Fa72j-00035Q-48
	for emacs-devel@gnu.org; Sun, 30 Apr 2006 04:12:39 -0400
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Fa72i-00035L-RD
	for emacs-devel@gnu.org; Sun, 30 Apr 2006 04:12:36 -0400
Original-Received: from [66.111.49.30] (helo=icarus.asclepian.ie)
	by monty-python.gnu.org with esmtp (Exim 4.52)
	id 1Fa76F-0001zb-Qd; Sun, 30 Apr 2006 04:16:16 -0400
Original-Received: by icarus.asclepian.ie (Postfix, from userid 1003)
	id 0077D8008C; Sun, 30 Apr 2006 09:12:34 +0100 (IST)
Original-To: rms@gnu.org
In-Reply-To: <E1Fa2EV-0003pC-1U@fencepost.gnu.org> 
X-Mailer: VM 7.17 under 21.5 (beta25) "eggplant" (+CVS-20060325) XEmacs Lucid
X-NS5-file-as-sent: t
X-Echelon-distraction: AK-47 AK-47 Ruby Ridge Serbian Croatian encryption 
X-Generated-By: Patcher version 3.8
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:53645
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/53645>

=20
 Ar an nao=C3=BA l=C3=A1 is fiche de m=C3=AD Aibr=C3=A9an, scr=C3=ADobh R=
ichard Stallman: =20
=20
 > [Comments on the text taken into account in the revised patch below.]=20
 >=20
 > [...]=20
 >=20
 > What is the reason for needing both \u and \U, and the difference? Why=
=20
 > not use a syntax like that of \x?=20
=20
They are both fixed-length expressions, which is good, because people get=
=20
into the habit of typing "\u0123As I walked out one evening" instead of t=
he=20
more disastrous "\u123As I walked out one evening". We could provide the=20
same functionality with just the \U00ABCDEF syntax, but since the code=20
points above #xFFFF are very rarely used, the need to provide the initial=
=20
four zeroes would be very annoying for the majority of the time. =20
=20
The reason the approach is not to have variable length constants as is us=
ed=20
with \x is exactly the "\u0123As I" versus "\u123As I walked out" issue=20
above. =20

lispref/ChangeLog addition:

2006-04-30  Aidan Kehoe  <kehoea@parhasard.net>

	* objects.texi (Character Type):
        Describe the Unicode character escape syntax; \uABCD or \U00ABCDE=
F=20
        specifies Unicode characters U+ABCD and U+ABCDEF respectively. =20
=09

src/ChangeLog addition:

2006-04-30  Aidan Kehoe  <kehoea@parhasard.net>

	* lread.c (read_escape):
        Provide a Unicode character escape syntax; \u followed by exactly=
=20
        four or \U followed by exactly eight hex digits in a comment or=20
        string is read as a Unicode character with that code point. =20
=09

GNU Emacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c lispref/objects.texi

Index: lispref/objects.texi
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /sources/emacs/emacs/lispref/objects.texi,v
retrieving revision 1.51
diff -u -u -r1.51 objects.texi
--- lispref/objects.texi	6 Feb 2006 11:55:10 -0000	1.51
+++ lispref/objects.texi	30 Apr 2006 08:08:05 -0000
@@ -431,6 +431,20 @@
 bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper.
 @end ifnottex
=20
+@cindex unicode character escape
+  Emacs provides a syntax for specifying characters by their Unicode cod=
e
+points.  @code{?\uABCD} represents a character that maps to the code
+point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files,
+Unicode-oriented fonts, etc.).  There is a slightly different syntax for
+specifying characters with code points above @code{#xFFFF};
+@code{\U00ABCDEF} represents an Emacs character that maps to the code
+point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs
+character exists.
+
+  Unlike in some other languages, while this syntax is available for
+character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
 @cindex @samp{\} in character constant
 @cindex backslash in character constant
 @cindex octal character code
Index: src/lread.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -u -r1.350 lread.c
--- src/lread.c	27 Feb 2006 02:04:35 -0000	1.350
+++ src/lread.c	30 Apr 2006 08:08:07 -0000
@@ -1743,6 +1743,9 @@
      int *byterep;
 {
   register int c =3D READCHAR;
+  /* \u allows up to four hex digits, \U up to eight. Default to the
+     behaviour for \u, and change this value in the case that \U is seen=
. */
+  int unicode_hex_count =3D 4;
=20
   *byterep =3D 0;
=20
@@ -1907,6 +1910,48 @@
 	return i;
       }
=20
+    case 'U':
+      /* Post-Unicode-2.0: Up to eight hex chars */
+      unicode_hex_count =3D 8;
+    case 'u':
+
+      /* A Unicode escape. We only permit them in strings and characters=
,
+	 not arbitrarily in the source code as in some other languages. */
+      {
+	int i =3D 0;
+	int count =3D 0;
+	Lisp_Object lisp_char;
+	while (++count <=3D unicode_hex_count)
+	  {
+	    c =3D READCHAR;
+	    /* isdigit(), isalpha() may be locale-specific, which we don't
+	       want. */
+	    if      (c >=3D '0' && c <=3D '9')  i =3D (i << 4) + (c - '0');
+	    else if (c >=3D 'a' && c <=3D 'f')  i =3D (i << 4) + (c - 'a') + 10=
;
+            else if (c >=3D 'A' && c <=3D 'F')  i =3D (i << 4) + (c - 'A=
') + 10;
+	    else
+	      {
+		error ("Non-hex digit used for Unicode escape");
+		break;
+	      }
+	  }
+
+	lisp_char =3D call2(intern("decode-char"), intern("ucs"),
+			  make_number(i));
+
+	if (EQ(Qnil, lisp_char))
+	  {
+	    /* This is ugly and horrible and trashes the user's data. */
+	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201,=20
+				       34 + 128, 46 + 128));
+            return i;
+	  }
+	else
+	  {
+	    return XFASTINT (lisp_char);
+	  }
+      }
+
     default:
       if (BASE_LEADING_CODE_P (c))
 	c =3D read_multibyte (c, readcharfun);

--=20
In the beginning God created the heavens and the earth. And God was a=20
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:=20
=E2=80=9CSi, mi chiamano Mimi...=E2=80=9D