From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Aidan Kehoe Newsgroups: gmane.emacs.devel Subject: [PATCH] Unicode Lisp reader escapes. Date: Sat, 3 Jun 2006 20:44:22 +0200 Message-ID: <17537.55430.367248.888471@parhasard.net> References: <17491.34779.959316.484740@parhasard.net> <17492.29148.246942.842300@parhasard.net> <8764kkawsf.fsf@jurta.org> <87vesi6nh1.fsf@jurta.org> <878xp8g2a9.fsf@jurta.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1149360120 32716 80.91.229.2 (3 Jun 2006 18:42:00 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 3 Jun 2006 18:42:00 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Jun 03 20:41:57 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1Fmb4I-0003zz-FI for ged-emacs-devel@m.gmane.org; Sat, 03 Jun 2006 20:41:50 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Fmb4H-0002tz-T8 for ged-emacs-devel@m.gmane.org; Sat, 03 Jun 2006 14:41:49 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Fmb45-0002sR-PM for emacs-devel@gnu.org; Sat, 03 Jun 2006 14:41:37 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Fmb44-0002rw-2x for emacs-devel@gnu.org; Sat, 03 Jun 2006 14:41:37 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Fmb43-0002rl-KK for emacs-devel@gnu.org; Sat, 03 Jun 2006 14:41:35 -0400 Original-Received: from [66.111.49.30] (helo=icarus.asclepian.ie) by monty-python.gnu.org with esmtp (Exim 4.52) id 1FmbAo-0001cq-0F for emacs-devel@gnu.org; Sat, 03 Jun 2006 14:48:34 -0400 Original-Received: by icarus.asclepian.ie (Postfix, from userid 1003) id E28098008C; Sat, 3 Jun 2006 19:41:34 +0100 (IST) Original-To: emacs-devel@gnu.org X-Mailer: VM 7.17 under 21.5 (beta26) "endive" (+CVS-20060512) XEmacs Lucid In-Reply-To: X-NS5-file-as-sent: t X-Echelon-distraction: Project Monarch terrorist plutonium Rule Psix Kibo SCUD missile X-Generated-By: Patcher version 3.8 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:55673 Archived-At: Jonas Jacobson just sent me confirmation that my once again signed assignments have been received, together with PDF copies of same. Given that, here is my final version of the patch I proposed in my first mail; differences from that version are an entry in the NEWS file, some prose style changes in the manual, and a GCPRO to protect readcharfun in lread.c. etc/ChangeLog addition: 2006-06-03 Aidan Kehoe * NEWS: Describe the new syntax for specifying characters with Unicode escapes. lispref/ChangeLog addition: 2006-06-03 Aidan Kehoe * objects.texi (Character Type): Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF specifies Unicode characters U+ABCD and U+ABCDEF respectively. src/ChangeLog addition: 2006-06-03 Aidan Kehoe * lread.c (read_escape): Provide a Unicode character escape syntax; \u followed by exactly four or \U followed by exactly eight hex digits in a comment or string is read as a Unicode character with that code point. GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c lispref/objects.texi etc/NEWS Index: etc/NEWS =================================================================== RCS file: /sources/emacs/emacs/etc/NEWS,v retrieving revision 1.1337 diff -u -u -r1.1337 NEWS --- etc/NEWS 2 May 2006 01:47:57 -0000 1.1337 +++ etc/NEWS 3 Jun 2006 18:16:51 -0000 @@ -3772,6 +3772,13 @@ been declared obsolete. +++ +*** New syntax: \uXXXX and \UXXXXXXXX specify Unicode code points in hex. +Use "\u0428" to specify a string consisting of CYRILLIC CAPITAL LETTER SHA, +or "\U0001D6E2" to specify one consisting of MATHEMATICAL ITALIC CAPITAL +ALPHA (the latter is greater than #xFFFF and thus needs the longer +syntax). Also available for characters. + ++++ ** Displaying warnings to the user. See the functions `warn' and `display-warning', or the Lisp Manual. Index: lispref/objects.texi =================================================================== RCS file: /sources/emacs/emacs/lispref/objects.texi,v retrieving revision 1.53 diff -u -u -r1.53 objects.texi --- lispref/objects.texi 1 May 2006 15:05:48 -0000 1.53 +++ lispref/objects.texi 3 Jun 2006 18:16:52 -0000 @@ -431,6 +431,20 @@ bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper. @end ifnottex +@cindex unicode character escape + Emacs provides a syntax for specifying characters by their Unicode code +points. @code{?\uABCD} represents a character that maps to the code +point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files, +Unicode-oriented fonts, etc.). There is a slightly different syntax for +specifying characters with code points above @code{#xFFFF}; +@code{\U00ABCDEF} represents an Emacs character that maps to the code +point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs +character exists. + + Unlike in some other languages, while this syntax is available for +character literals, and (see later) in strings, it is not available +elsewhere in your Lisp source code. + @cindex @samp{\} in character constant @cindex backslash in character constant @cindex octal character code Index: src/lread.c =================================================================== RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 3 Jun 2006 18:16:54 -0000 @@ -1743,6 +1743,9 @@ int *byterep; { register int c = READCHAR; + /* \u allows up to four hex digits, \U up to eight. Default to the + behaviour for \u, and change this value in the case that \U is seen. */ + int unicode_hex_count = 4; *byterep = 0; @@ -1907,6 +1910,52 @@ return i; } + case 'U': + /* Post-Unicode-2.0: Up to eight hex chars */ + unicode_hex_count = 8; + case 'u': + + /* A Unicode escape. We only permit them in strings and characters, + not arbitrarily in the source code as in some other languages. */ + { + int i = 0; + int count = 0; + Lisp_Object lisp_char; + struct gcpro gcpro1; + + while (++count <= unicode_hex_count) + { + c = READCHAR; + /* isdigit(), isalpha() may be locale-specific, which we don't + want. */ + if (c >= '0' && c <= '9') i = (i << 4) + (c - '0'); + else if (c >= 'a' && c <= 'f') i = (i << 4) + (c - 'a') + 10; + else if (c >= 'A' && c <= 'F') i = (i << 4) + (c - 'A') + 10; + else + { + error ("Non-hex digit used for Unicode escape"); + break; + } + } + + GCPRO1 (readcharfun); + lisp_char = call2(intern("decode-char"), intern("ucs"), + make_number(i)); + UNGCPRO; + + if (EQ(Qnil, lisp_char)) + { + /* This is ugly and horrible and trashes the user's data. */ + XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, + 34 + 128, 46 + 128)); + return i; + } + else + { + return XFASTINT (lisp_char); + } + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); -- Aidan Kehoe, http://www.parhasard.net/