From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Oliver Scholz Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] Unicode Lisp reader escapes Date: Fri, 05 May 2006 19:23:36 +0200 Message-ID: References: <17491.34779.959316.484740@parhasard.net> <87odyfnqcj.fsf-monnier+emacs@gnu.org> <87psiuj8at.fsf-monnier+emacs@gnu.org> <87bquctz7t.fsf-monnier+emacs@gnu.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: sea.gmane.org 1146849928 15168 80.91.229.2 (5 May 2006 17:25:28 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Fri, 5 May 2006 17:25:28 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri May 05 19:25:20 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1Fc43D-00089U-08 for ged-emacs-devel@m.gmane.org; Fri, 05 May 2006 19:25:11 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Fc43C-0006jP-De for ged-emacs-devel@m.gmane.org; Fri, 05 May 2006 13:25:10 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Fc42z-0006j3-II for emacs-devel@gnu.org; Fri, 05 May 2006 13:24:57 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Fc42x-0006iC-DW for emacs-devel@gnu.org; Fri, 05 May 2006 13:24:56 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Fc42x-0006hu-7e for emacs-devel@gnu.org; Fri, 05 May 2006 13:24:55 -0400 Original-Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.52) id 1Fc439-0001qG-8Y for emacs-devel@gnu.org; Fri, 05 May 2006 13:25:07 -0400 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1Fc42n-00085e-QG for emacs-devel@gnu.org; Fri, 05 May 2006 19:24:45 +0200 Original-Received: from dslb-084-058-191-070.pools.arcor-ip.net ([84.58.191.70]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 05 May 2006 19:24:45 +0200 Original-Received: from alkibiades by dslb-084-058-191-070.pools.arcor-ip.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 05 May 2006 19:24:45 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-To: emacs-devel@gnu.org Original-Lines: 287 Original-X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: dslb-084-058-191-070.pools.arcor-ip.net User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (gnu/linux) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:53958 Archived-At: --=-=-= For what it's worth, I just tried the attached little stress test on an updated C port of `decode-char' in order to check whether it returns equivalent results. It does. (Well, except intentional differences like that `ucs_to_internal' throws an error where `decode-char' returns nil.) Basically the test runs through all positive integers up to MAX_CHAR and inserts an alist into a temp buffer with each car being the integer and each cdr being a character in the \u syntax (e.g. `?\u3b1'). It then reads that alist again and checks whether `decode-char' on its car is `eq' to its cdr. I tried it with and without `utf-translate-cjk-mode' and with and without `utf-fragment-on-decoding'. Since all tests succeed, ucs_to_internal and `decode-char' are functionally equivalent on all supported characters. The test: --=-=-= Content-Type: application/emacs-lisp Content-Disposition: attachment; filename=ucs-test.el (defconst my-max-char ; See MAX_CHAR in charset.h (ash #x1F 14)) (defvar my-ucs-print t) (defun my-ucs-escape-test () (interactive) (with-temp-buffer ;; Insert an alist of characters. (insert "(") (dotimes (n my-max-char) (let ((c (decode-char 'ucs n))) (cond ((and c (char-valid-p c)) ; `decode-char' can handle `n'. (insert (format "(%d . ?\\u%x)\n" n n))) ((char-valid-p n) ; `decode-char' can not handle `n'. (insert (format "(%d . nil)" n))) (t (insert (format "(%d . invalid)" n)))))) (insert ")") ;; Read this list and check it agains `decode-char'. (goto-char (point-min)) (let ((char-list (read (current-buffer)))) (message "Char list succesfully read.") (sit-for 0) (when my-ucs-print (switch-to-buffer (generate-new-buffer "*UCS test*"))) (dolist (c char-list) (unless (or (eq (cdr c) 'invalid) (eq (decode-char 'ucs (car c)) (cdr c))) (error "Differing results for `decode-char' on %d and \\u%04X" (car c) (cdr c))) (when my-ucs-print (cond ((eq (cdr c) 'invalid) (insert (format "U+%04X -- Invalid\n" (car c)))) ((not (cdr c)) (insert (format "U+%04X -- Unsupported\n" (car c)))) (t (insert (format "U+%04X -- %c\n" (car c) (cdr c))))))) (message "All characters checked successfully.")))) --=-=-= The updated patch: --=-=-= Content-Disposition: inline; filename=ucs-escapes.diff Index: src/lread.c =================================================================== RCS file: /cvsroot/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 5 May 2006 17:09:37 -0000 @@ -87,6 +87,9 @@ Lisp_Object Qbackquote, Qcomma, Qcomma_at, Qcomma_dot, Qfunction; Lisp_Object Qinhibit_file_name_operation; Lisp_Object Qeval_buffer_list, Veval_buffer_list; +Lisp_Object Qutf_translate_cjk_mode, Qutf_translate_cjk_lang_env, Qutf_translate_cjk_load_tables; +Lisp_Object Qutf_subst_table_for_decode, Qtranslation_hash_table; +Lisp_Object Qutf_translation_table_for_decode, Qtranslation_table; extern Lisp_Object Qevent_symbol_element_mask; extern Lisp_Object Qfile_exists_p; @@ -1731,6 +1734,110 @@ return str[0]; } + +#define READ_HEX_ESCAPE(i, c) \ + while (1) \ + { \ + c = READCHAR; \ + if (c >= '0' && c <= '9') \ + { \ + i *= 16; \ + i += c - '0'; \ + } \ + else if ((c >= 'a' && c <= 'f') \ + || (c >= 'A' && c <= 'F')) \ + { \ + i *= 16; \ + if (c >= 'a' && c <= 'f') \ + i += c - 'a' + 10; \ + else \ + i += c - 'A' + 10; \ + } \ + else \ + { \ + UNREAD (c); \ + break; \ + } \ + } + + + +/* Return the internal character coresponding to an UCS code point.*/ + +int +ucs_to_internal (ucs) + int ucs; +{ + int c = 0; + Lisp_Object tmp_char; + + if (! EQ (Qnil, SYMBOL_VALUE (Qutf_translate_cjk_mode))) + /* cf. `utf-lookup-subst-table-for-decode' */ + { + Lisp_Object hash; + + if (EQ (Qnil, SYMBOL_VALUE (Qutf_translate_cjk_lang_env))) + call0 (Qutf_translate_cjk_load_tables); + + hash = Fget (Qutf_subst_table_for_decode, Qtranslation_hash_table); + + if (HASH_TABLE_P (hash)) + { + tmp_char = Fgethash (make_number (ucs), hash, Qnil); + if (! EQ (Qnil, tmp_char)) + { + CHECK_NUMBER (tmp_char); + c = XFASTINT (tmp_char); + } + } + } + + if (c) + /* We found the character already in the translation hash table. + Do nothing. */ + ; + else if (ucs < 160) + c = ucs; + else if (ucs < 256) + c = MAKE_CHAR (charset_latin_iso8859_1, ucs, 0); + else if (ucs < 0x2500) + { + ucs -= 0x0100; + c = MAKE_CHAR (charset_mule_unicode_0100_24ff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + else if (ucs < 0x3400) + { + ucs -= 0x2500; + c = MAKE_CHAR (charset_mule_unicode_2500_33ff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + else if ((ucs >= 0xE000) && (ucs < 0x10000)) + { + ucs -= 0xE000; + c = MAKE_CHAR (charset_mule_unicode_e000_ffff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + + if (c || ucs == 0) /* U+0000 is also a valid character. */ + { + Lisp_Object vect = Fget (Qutf_translation_table_for_decode, + Qtranslation_table); + if (CHAR_TABLE_P (vect)) + { + tmp_char = Faref (vect, make_number (c)); + if (! EQ (Qnil, tmp_char)) + return XFASTINT (tmp_char); + } + return c; + } + else error ("Invalid or unsupported UCS character: %x", ucs); +} + + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, store 1 into *BYTEREP. If the escape sequence forces multibyte, store 2 into *BYTEREP. @@ -1879,34 +1986,23 @@ /* A hex escape, as in ANSI C. */ { int i = 0; - while (1) - { - c = READCHAR; - if (c >= '0' && c <= '9') - { - i *= 16; - i += c - '0'; - } - else if ((c >= 'a' && c <= 'f') - || (c >= 'A' && c <= 'F')) - { - i *= 16; - if (c >= 'a' && c <= 'f') - i += c - 'a' + 10; - else - i += c - 'A' + 10; - } - else - { - UNREAD (c); - break; - } - } - + READ_HEX_ESCAPE (i, c); *byterep = 2; return i; } + case 'u': + /* A hexadecimal reference to an UCS character. */ + { + int i = 0; + + READ_HEX_ESCAPE (i, c); + *byterep = 2; + + return ucs_to_internal (i); + + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); @@ -4121,6 +4217,27 @@ Vloads_in_progress = Qnil; staticpro (&Vloads_in_progress); + + Qutf_translate_cjk_mode = intern ("utf-translate-cjk-mode"); + staticpro (&Qutf_translate_cjk_mode); + + Qutf_translate_cjk_lang_env = intern ("utf-translate-cjk-lang-env"); + staticpro (&Qutf_translate_cjk_lang_env); + + Qutf_translate_cjk_load_tables = intern ("utf-translate-cjk-load-tables"); + staticpro (&Qutf_translate_cjk_load_tables); + + Qutf_subst_table_for_decode = intern ("utf-subst-table-for-decode"); + staticpro (&Qutf_subst_table_for_decode); + + Qtranslation_hash_table = intern ("translation-hash-table"); + staticpro (&Qutf_subst_table_for_decode); + + Qutf_translation_table_for_decode = intern ("utf-translation-table-for-decode"); + staticpro (&Qutf_translation_table_for_decode); + + Qtranslation_table = intern ("translation-table"); + staticpro (&Qtranslation_table); } /* arch-tag: a0d02733-0f96-4844-a659-9fd53c4f414d --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Oliver -- 16 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! --=-=-= Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel --=-=-=--