From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Oliver Scholz Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] Unicode Lisp reader escapes Date: Thu, 04 May 2006 10:23:46 +0200 Message-ID: References: <17491.34779.959316.484740@parhasard.net> <87odyfnqcj.fsf-monnier+emacs@gnu.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: sea.gmane.org 1146731180 13358 80.91.229.2 (4 May 2006 08:26:20 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 4 May 2006 08:26:20 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu May 04 10:26:18 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1FbZA2-0000vD-Io for ged-emacs-devel@m.gmane.org; Thu, 04 May 2006 10:26:13 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FbZA1-0007ZR-TA for ged-emacs-devel@m.gmane.org; Thu, 04 May 2006 04:26:09 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1FbZ9o-0007YJ-9W for emacs-devel@gnu.org; Thu, 04 May 2006 04:25:56 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1FbZ9k-0007X1-Vq for emacs-devel@gnu.org; Thu, 04 May 2006 04:25:55 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1FbZ9k-0007Wy-R4 for emacs-devel@gnu.org; Thu, 04 May 2006 04:25:52 -0400 Original-Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.52) id 1FbZAJ-0002HS-LA for emacs-devel@gnu.org; Thu, 04 May 2006 04:26:28 -0400 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1FbZ9Y-0000ou-A9 for emacs-devel@gnu.org; Thu, 04 May 2006 10:25:40 +0200 Original-Received: from dslb-084-058-181-057.pools.arcor-ip.net ([84.58.181.57]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 04 May 2006 10:25:40 +0200 Original-Received: from alkibiades by dslb-084-058-181-057.pools.arcor-ip.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 04 May 2006 10:25:40 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-To: emacs-devel@gnu.org Original-Lines: 191 Original-X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: dslb-084-058-181-057.pools.arcor-ip.net User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (gnu/linux) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:53882 Archived-At: For what it's worth, I made a stab at implementing \u analogous to \x---including a port of the core functionality of `decode-char' to C. As for the current discussion: I regard both e.g. \u3b1 and (decode-char 'ucs #x3b1) as a means to say "Give me that abstract character---the greek letter alpha---I don't care about your internal encoding, *just use your defaults*, but give me that character." So, effectively the respective functions should deal with fragmentation and the like. It would matter, for instance, if the fontset specifies different glyphs for the same abstract character depending on the charsets. But I see Eli's point. Ideally, the conversion (to ISO 8859-X) wouldn't take place when reading the string, but when it is displayed/inserted into a buffer. Logically, because that's when the difference between abstract character and internal representation should become effective. Practically, because: if the user loads a Library containing strings with \u escapes (or `decode-char' expressions eval'ed at load-time) and *then* customises the value of `utf-fragment-on-decoding', the change won't affect those characters. However, I believe that this is rather a minor obscurity than a bug; I don't believe that anybody would get bitten by this seriously. Oliver Here's the patch, only slightly tested: Index: src/lread.c =================================================================== RCS file: /cvsroot/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 4 May 2006 08:00:53 -0000 @@ -1731,6 +1731,102 @@ return str[0]; } + +#define READ_HEX_ESCAPE(i, c) \ + while (1) \ + { \ + c = READCHAR; \ + if (c >= '0' && c <= '9') \ + { \ + i *= 16; \ + i += c - '0'; \ + } \ + else if ((c >= 'a' && c <= 'f') \ + || (c >= 'A' && c <= 'F')) \ + { \ + i *= 16; \ + if (c >= 'a' && c <= 'f') \ + i += c - 'a' + 10; \ + else \ + i += c - 'A' + 10; \ + } \ + else \ + { \ + UNREAD (c); \ + break; \ + } \ + } + + + +/* Return the internal character coresponding to an UCS code point.*/ + +int +ucs_to_internal (ucs) + int ucs; +{ + int c = 0; + Lisp_Object tmp_char; + + if (! EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-mode")))) + /* cf. `utf-lookup-subst-table-for-decode' */ + { + if (EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-lang-env")))) + call0 (intern ("utf-translate-cjk-load-tables")); + tmp_char = Fgethash (make_number (ucs), + Fget (intern ("utf-subst-table-for-decode"), + intern ("translation-hash-table")), + Qnil); + if (! EQ (Qnil, tmp_char)) + { + CHECK_NUMBER (tmp_char); + c = XFASTINT (tmp_char); + } + } + + if (c) + /* We found the character already in the translation hash table. + Do nothing. */ + ; + else if (ucs < 160) + c = ucs; + else if (ucs < 256) + c = MAKE_CHAR (charset_latin_iso8859_1, ucs, 0); + else if (ucs < 0x2500) + { + ucs -= 0x0100; + c = MAKE_CHAR (charset_mule_unicode_0100_24ff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + else if (ucs < 0x3400) + { + ucs -= 0x2500; + c = MAKE_CHAR (charset_mule_unicode_2500_33ff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + else if ((ucs >= 0xE000) && (ucs < 0x10000)) + { + ucs -= 0xE000; + c = MAKE_CHAR (charset_mule_unicode_e000_ffff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + + if (c) + { + Lisp_Object vect = Fget (intern ("utf-translation-table-for-decode"), + intern ("translation-table")); + tmp_char = Faref (vect, make_number (c)); + if (! EQ (Qnil, tmp_char)) + return XFASTINT (tmp_char); + return c; + } + else error ("Invalid or unsupported UCS character: %x", ucs); +} + + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, store 1 into *BYTEREP. If the escape sequence forces multibyte, store 2 into *BYTEREP. @@ -1879,34 +1975,24 @@ /* A hex escape, as in ANSI C. */ { int i = 0; - while (1) - { - c = READCHAR; - if (c >= '0' && c <= '9') - { - i *= 16; - i += c - '0'; - } - else if ((c >= 'a' && c <= 'f') - || (c >= 'A' && c <= 'F')) - { - i *= 16; - if (c >= 'a' && c <= 'f') - i += c - 'a' + 10; - else - i += c - 'A' + 10; - } - else - { - UNREAD (c); - break; - } - } - + READ_HEX_ESCAPE (i, c); *byterep = 2; return i; } + case 'u': + /* A hexadecimal reference to an UCS character. */ + { + int i = 0; + Lisp_Object lisp_char; + + READ_HEX_ESCAPE (i, c); + *byterep = 2; + + return ucs_to_internal (i); + + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); -- 15 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité!