From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Aidan Kehoe Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] Unicode Lisp reader escapes Date: Sun, 30 Apr 2006 10:14:20 +0200 Message-ID: <17492.29148.246942.842300@parhasard.net> References: <17491.34779.959316.484740@parhasard.net> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1146384794 12357 80.91.229.2 (30 Apr 2006 08:13:14 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sun, 30 Apr 2006 08:13:14 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 30 10:13:11 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1Fa739-0000SS-1W for ged-emacs-devel@m.gmane.org; Sun, 30 Apr 2006 10:13:03 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Fa738-00039d-BA for ged-emacs-devel@m.gmane.org; Sun, 30 Apr 2006 04:13:02 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Fa72n-000381-76 for emacs-devel@gnu.org; Sun, 30 Apr 2006 04:12:41 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Fa72j-00035Q-48 for emacs-devel@gnu.org; Sun, 30 Apr 2006 04:12:39 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Fa72i-00035L-RD for emacs-devel@gnu.org; Sun, 30 Apr 2006 04:12:36 -0400 Original-Received: from [66.111.49.30] (helo=icarus.asclepian.ie) by monty-python.gnu.org with esmtp (Exim 4.52) id 1Fa76F-0001zb-Qd; Sun, 30 Apr 2006 04:16:16 -0400 Original-Received: by icarus.asclepian.ie (Postfix, from userid 1003) id 0077D8008C; Sun, 30 Apr 2006 09:12:34 +0100 (IST) Original-To: rms@gnu.org In-Reply-To: X-Mailer: VM 7.17 under 21.5 (beta25) "eggplant" (+CVS-20060325) XEmacs Lucid X-NS5-file-as-sent: t X-Echelon-distraction: AK-47 AK-47 Ruby Ridge Serbian Croatian encryption X-Generated-By: Patcher version 3.8 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:53645 Archived-At: =20 Ar an nao=C3=BA l=C3=A1 is fiche de m=C3=AD Aibr=C3=A9an, scr=C3=ADobh R= ichard Stallman: =20 =20 > [Comments on the text taken into account in the revised patch below.]=20 >=20 > [...]=20 >=20 > What is the reason for needing both \u and \U, and the difference? Why= =20 > not use a syntax like that of \x?=20 =20 They are both fixed-length expressions, which is good, because people get= =20 into the habit of typing "\u0123As I walked out one evening" instead of t= he=20 more disastrous "\u123As I walked out one evening". We could provide the=20 same functionality with just the \U00ABCDEF syntax, but since the code=20 points above #xFFFF are very rarely used, the need to provide the initial= =20 four zeroes would be very annoying for the majority of the time. =20 =20 The reason the approach is not to have variable length constants as is us= ed=20 with \x is exactly the "\u0123As I" versus "\u123As I walked out" issue=20 above. =20 lispref/ChangeLog addition: 2006-04-30 Aidan Kehoe * objects.texi (Character Type): Describe the Unicode character escape syntax; \uABCD or \U00ABCDE= F=20 specifies Unicode characters U+ABCD and U+ABCDEF respectively. =20 =09 src/ChangeLog addition: 2006-04-30 Aidan Kehoe * lread.c (read_escape): Provide a Unicode character escape syntax; \u followed by exactly= =20 four or \U followed by exactly eight hex digits in a comment or=20 string is read as a Unicode character with that code point. =20 =09 GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c lispref/objects.texi Index: lispref/objects.texi =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /sources/emacs/emacs/lispref/objects.texi,v retrieving revision 1.51 diff -u -u -r1.51 objects.texi --- lispref/objects.texi 6 Feb 2006 11:55:10 -0000 1.51 +++ lispref/objects.texi 30 Apr 2006 08:08:05 -0000 @@ -431,6 +431,20 @@ bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper. @end ifnottex =20 +@cindex unicode character escape + Emacs provides a syntax for specifying characters by their Unicode cod= e +points. @code{?\uABCD} represents a character that maps to the code +point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files, +Unicode-oriented fonts, etc.). There is a slightly different syntax for +specifying characters with code points above @code{#xFFFF}; +@code{\U00ABCDEF} represents an Emacs character that maps to the code +point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs +character exists. + + Unlike in some other languages, while this syntax is available for +character literals, and (see later) in strings, it is not available +elsewhere in your Lisp source code. + @cindex @samp{\} in character constant @cindex backslash in character constant @cindex octal character code Index: src/lread.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 30 Apr 2006 08:08:07 -0000 @@ -1743,6 +1743,9 @@ int *byterep; { register int c =3D READCHAR; + /* \u allows up to four hex digits, \U up to eight. Default to the + behaviour for \u, and change this value in the case that \U is seen= . */ + int unicode_hex_count =3D 4; =20 *byterep =3D 0; =20 @@ -1907,6 +1910,48 @@ return i; } =20 + case 'U': + /* Post-Unicode-2.0: Up to eight hex chars */ + unicode_hex_count =3D 8; + case 'u': + + /* A Unicode escape. We only permit them in strings and characters= , + not arbitrarily in the source code as in some other languages. */ + { + int i =3D 0; + int count =3D 0; + Lisp_Object lisp_char; + while (++count <=3D unicode_hex_count) + { + c =3D READCHAR; + /* isdigit(), isalpha() may be locale-specific, which we don't + want. */ + if (c >=3D '0' && c <=3D '9') i =3D (i << 4) + (c - '0'); + else if (c >=3D 'a' && c <=3D 'f') i =3D (i << 4) + (c - 'a') + 10= ; + else if (c >=3D 'A' && c <=3D 'F') i =3D (i << 4) + (c - 'A= ') + 10; + else + { + error ("Non-hex digit used for Unicode escape"); + break; + } + } + + lisp_char =3D call2(intern("decode-char"), intern("ucs"), + make_number(i)); + + if (EQ(Qnil, lisp_char)) + { + /* This is ugly and horrible and trashes the user's data. */ + XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201,=20 + 34 + 128, 46 + 128)); + return i; + } + else + { + return XFASTINT (lisp_char); + } + } + default: if (BASE_LEADING_CODE_P (c)) c =3D read_multibyte (c, readcharfun); --=20 In the beginning God created the heavens and the earth. And God was a=20 bug-eyed, hexagonal smurf with a head of electrified hair; and God said:=20 =E2=80=9CSi, mi chiamano Mimi...=E2=80=9D