From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Aidan Kehoe Newsgroups: gmane.emacs.devel Subject: [PATCH] Unicode Lisp reader escapes Date: Sat, 29 Apr 2006 17:35:55 +0200 Message-ID: <17491.34779.959316.484740@parhasard.net> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1146324892 30318 80.91.229.2 (29 Apr 2006 15:34:52 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 29 Apr 2006 15:34:52 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Apr 29 17:34:50 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1FZrT7-0002o3-Sl for ged-emacs-devel@m.gmane.org; Sat, 29 Apr 2006 17:34:50 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FZrT7-00028V-CG for ged-emacs-devel@m.gmane.org; Sat, 29 Apr 2006 11:34:49 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1FZrSX-0001dL-Cd for emacs-devel@gnu.org; Sat, 29 Apr 2006 11:34:13 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1FZrSW-0001cZ-FP for emacs-devel@gnu.org; Sat, 29 Apr 2006 11:34:12 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FZrSW-0001cM-9z for emacs-devel@gnu.org; Sat, 29 Apr 2006 11:34:12 -0400 Original-Received: from [66.111.49.30] (helo=icarus.asclepian.ie) by monty-python.gnu.org with esmtp (Exim 4.52) id 1FZrVu-0002Nj-9S for emacs-devel@gnu.org; Sat, 29 Apr 2006 11:37:42 -0400 Original-Received: by icarus.asclepian.ie (Postfix, from userid 1003) id 4A4968008C; Sat, 29 Apr 2006 16:34:10 +0100 (IST) Original-To: emacs-devel@gnu.org X-Mailer: VM 7.17 under 21.5 (beta24) "dandelion" (+CVS-20051223) XEmacs Lucid X-NS5-file-as-sent: t X-Echelon-distraction: Ft. Meade domestic disruption CIA fissionable colonel Ft. Meade X-Generated-By: Patcher version 3.8 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:53596 Archived-At: I realise you are all focused on the release with an intensity that would scare small children, were any of them let near, but if any of you have a minute free, I=E2=80=99d love to hear philosophical and technical objecti= ons to the below. The background is that it hasn=E2=80=99t ever been possible to consistent= ly specify a non-Latin-1 character by means of a general escape sequence, since what character a given integer represents varies from release to release and e= ven from=C2=A0invocation to invocation. The below allows you to specify a bac= kslash escape with exactly four or exactly eight hexadecimal digits in a charact= er or string, and have the editor interpret them as the corresponding Unicod= e code point. So, ?\u20AC would be interpreted as the Euro sign, "\u0448" a= s Cyrillic sha, ?\U001D0ED as Byzantine musical symbol arktiko ke.=20 Why not wait until the Unicode branch is merged? Well, that won=E2=80=99t= solve the problem either; people naturally want their code to be as compatible as possible, so they will avoid the assumption that the integer-to-character mapping is Unicode compatible as long as there are editors in the wild fo= r which that is not true. If this is integrated a good bit before the Unico= de branch is (which is what I would like), it will mean people can use this syntax (which most modern programming languages have already, and which people use) and be sure it=E2=80=99s compatible years before what would o= therwise be the case.=20 lispref/ChangeLog addition: 2006-04-29 Aidan Kehoe * objects.texi (Character Type): Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF specifies Unicode characters U+ABCD and U+ABCDEF respectively.=20 =09 src/ChangeLog addition: 2006-04-29 Aidan Kehoe * lread.c (read_escape): Provide a Unicode character escape syntax; \u followed by exactly four or \U followed by exactly eight hex digits in a comment or string is read as a Unicode character with that code point.=20 GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c lispref/objects.texi Index: lispref/objects.texi =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /sources/emacs/emacs/lispref/objects.texi,v retrieving revision 1.51 diff -u -u -r1.51 objects.texi --- lispref/objects.texi 6 Feb 2006 11:55:10 -0000 1.51 +++ lispref/objects.texi 29 Apr 2006 15:15:09 -0000 @@ -431,6 +431,20 @@ bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper. @end ifnottex =20 +@cindex unicode character escape + Emacs provides a syntax for specifying characters by their Unicode cod= e +points. @samp{?\uABCD} will give you an Emacs character that maps to +the code point @samp{U+ABCD} in Unicode-based representations (UTF-8 +text files, Unicode-oriented fonts, etc.) There is a slightly different +syntax for specifying characters with code points above @samp{#xFFFF}; +@samp{\U00ABCDEF} will give you an Emacs character that maps to the code +point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs +character exists. + + Unlike in some other languages, while this syntax is available for +character literals, and (see later) in strings, it is not available +elsewhere in your Lisp source code. + @cindex @samp{\} in character constant @cindex backslash in character constant @cindex octal character code Index: src/lread.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 29 Apr 2006 15:15:10 -0000 @@ -1743,6 +1743,9 @@ int *byterep; { register int c =3D READCHAR; + /* \u allows up to four hex digits, \U up to eight. Default to the + behaviour for \u, and change this value in the case that \U is seen= . */ + int unicode_hex_count =3D 4; =20 *byterep =3D 0; =20 @@ -1907,6 +1910,48 @@ return i; } =20 + case 'U': + /* Post-Unicode-2.0: Up to eight hex chars */ + unicode_hex_count =3D 8; + case 'u': + + /* A Unicode escape. We only permit them in strings and characters= , + not arbitrarily in the source code as in some other languages. */ + { + int i =3D 0; + int count =3D 0; + Lisp_Object lisp_char; + while (++count <=3D unicode_hex_count) + { + c =3D READCHAR; + /* isdigit(), isalpha() may be locale-specific, which we don't + want. */ + if (c >=3D '0' && c <=3D '9') i =3D (i << 4) + (c - '0'); + else if (c >=3D 'a' && c <=3D 'f') i =3D (i << 4) + (c - 'a') + 10= ; + else if (c >=3D 'A' && c <=3D 'F') i =3D (i << 4) + (c - 'A= ') + 10; + else + { + error ("Non-hex digit used for Unicode escape"); + break; + } + } + + lisp_char =3D call2(intern("decode-char"), intern("ucs"), + make_number(i)); + + if (EQ(Qnil, lisp_char)) + { + /* This is ugly and horrible and trashes the user's data. */ + XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201,=20 + 34 + 128, 46 + 128)); + return i; + } + else + { + return XFASTINT (lisp_char); + } + } + default: if (BASE_LEADING_CODE_P (c)) c =3D read_multibyte (c, readcharfun); --=20 In the beginning God created the heavens and the earth. And God was a bug-eyed, hexagonal smurf with a head of electrified hair; and God said: =E2=80=9CSi, mi chiamano Mimi...=E2=80=9D