* Character literals for Unicode (control) characters @ 2016-03-03 5:47 Lars Ingebrigtsen 2016-03-03 6:20 ` John Wiegley ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Lars Ingebrigtsen @ 2016-03-03 5:47 UTC (permalink / raw) To: emacs-devel I was implementing support for the <bdo> HTML tag the other day. (It's for overriding bidi directionality in text.) This is what I ended up with: (defun shr-tag-bdo (dom) (let* ((direction (dom-attr dom 'dir)) (char (cond ((equal direction "ltr") #x202d) ; LRO ((equal direction "rtl") #x202e)))) ; RLO (when char (insert char)) (shr-generic dom) (when char (insert #x202c)))) ; PDF And it just struck me that it would be kinda nice if Emacs had a literal character syntax for these things. I mean, we have such a syntax for some "problematic" ASCII characters already: We recommend writing ?\s instead of ? , and we recommend writing ?\n instead of ? , because that's just very confusing. And then I thought -- well, if we should have a literal syntax for Unicode control characters, why not for all of them? We do have the mapping already in Emacs, so it wouldn't be very difficult to implement... So. Three options: 1) Add a new syntax, perhaps something like ?\ucRIGHT-TO-LEFT-OVERRIDE for the Unicode control characters we care about. 2) Add a syntax for all Unicode characters, like ?\ucPILE-OF-POO. We can just write ?💩, so this isn't totally necessary, but perhaps it's nice? c) Do nothing, and continue writing code like the code above. Or start using the Unicode control characters directly in the code, but there lies madness. (Note Unicode control characters around the last part of the previous sentence.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-03 5:47 Character literals for Unicode (control) characters Lars Ingebrigtsen @ 2016-03-03 6:20 ` John Wiegley 2016-03-03 6:25 ` Lars Ingebrigtsen 2016-03-03 6:34 ` Drew Adams 2016-03-03 16:11 ` Paul Eggert 2 siblings, 1 reply; 47+ messages in thread From: John Wiegley @ 2016-03-03 6:20 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: emacs-devel >>>>> Lars Ingebrigtsen <larsi@gnus.org> writes: > 1) Add a new syntax, perhaps something like ?\ucRIGHT-TO-LEFT-OVERRIDE for > the Unicode control characters we care about. Would it just have to be for control characters we care about? Whatever list C-x 8 RET is drawing from, could it just lookup there? -- John Wiegley GPG fingerprint = 4710 CF98 AF9B 327B B80F http://newartisans.com 60E1 46C4 BD1A 7AC1 4BA2 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-03 6:20 ` John Wiegley @ 2016-03-03 6:25 ` Lars Ingebrigtsen 0 siblings, 0 replies; 47+ messages in thread From: Lars Ingebrigtsen @ 2016-03-03 6:25 UTC (permalink / raw) To: emacs-devel John Wiegley <jwiegley@gmail.com> writes: >>>>>> Lars Ingebrigtsen <larsi@gnus.org> writes: > >> 1) Add a new syntax, perhaps something like ?\ucRIGHT-TO-LEFT-OVERRIDE for >> the Unicode control characters we care about. > > Would it just have to be for control characters we care about? Whatever list > C-x 8 RET is drawing from, could it just lookup there? Yes, that was 2). :-) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 47+ messages in thread
* RE: Character literals for Unicode (control) characters 2016-03-03 5:47 Character literals for Unicode (control) characters Lars Ingebrigtsen 2016-03-03 6:20 ` John Wiegley @ 2016-03-03 6:34 ` Drew Adams 2016-03-03 16:11 ` Paul Eggert 2 siblings, 0 replies; 47+ messages in thread From: Drew Adams @ 2016-03-03 6:34 UTC (permalink / raw) To: Lars Ingebrigtsen, emacs-devel > And it just struck me that it would be kinda nice if Emacs had a > literal character syntax for these things... So. Three options: > > 1) Add a new syntax, perhaps something like ?\ucRIGHT-TO-LEFT- > OVERRIDE for the Unicode control characters we care about. > > 2) Add a syntax for all Unicode characters, like ?\ucPILE-OF-POO. > We can just write ?💩, so this isn't totally necessary, but > perhaps it's nice? > > c) Do nothing, and continue writing code like the code above. Or > start using the Unicode control characters directly in the code, > but there lies madness. (Note Unicode control characters around > the last part of the previous sentence.) 4) (or is it d? ;-)) Continue to write code like that above. When it helps, add a comment showing or describing the char. Or be able to hit a key to describe the char whose code is at point: (defun describe-char-code-at-point () "Describe character whose code is at point." (interactive) (let* ((sexp (thing-at-point 'sexp)) (chr (and sexp (read sexp)))) (unless (characterp chr) (error "No character code at point")) (with-temp-buffer (insert chr) (describe-char (1- (point)))))) Use it with point anywhere on a char code, e.g., #x202e, to see its description, including, e.g.: name: LEFT-TO-RIGHT OVERRIDE general-category: Cf (Other, Format) decomposition: (8237) ('') ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-03 5:47 Character literals for Unicode (control) characters Lars Ingebrigtsen 2016-03-03 6:20 ` John Wiegley 2016-03-03 6:34 ` Drew Adams @ 2016-03-03 16:11 ` Paul Eggert 2016-03-03 20:48 ` Eli Zaretskii ` (2 more replies) 2 siblings, 3 replies; 47+ messages in thread From: Paul Eggert @ 2016-03-03 16:11 UTC (permalink / raw) To: Lars Ingebrigtsen, emacs-devel On 03/02/2016 09:47 PM, Lars Ingebrigtsen wrote: > And then I thought -- well, if we should have a literal syntax for > Unicode control characters, why not for all of them? Something like that would make sense. The escape sequence should bracket the name, so that the escape sequences could be used in strings without ambiguity. Something like \u[NAME], say. I'd still prefer to use characters as-is in strings if they're displayable, e.g., the Lisp string: "Use Greek capital letters (Α–Ω) to denote figures." is more readable than: "Use Greek capital letters (\u[GREEK CAPITAL LETTER ALPHA]\u[EN DASH]\u[GREEK CAPITAL LETTER OMEGA]) to denote figures." But for undisplayable or hard-to-read characters the escape sequence would be a win. More issues: should we insist on the full official name? should we allow obsolescent aliases? lower-case instead of upper case? initial prefixes of names? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-03 16:11 ` Paul Eggert @ 2016-03-03 20:48 ` Eli Zaretskii 2016-03-03 23:58 ` Paul Eggert 2016-03-05 15:28 ` Philipp Stephani 2016-03-05 16:35 ` Clément Pit--Claudel 2 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-03-03 20:48 UTC (permalink / raw) To: Paul Eggert; +Cc: larsi, emacs-devel > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Thu, 3 Mar 2016 08:11:43 -0800 > > On 03/02/2016 09:47 PM, Lars Ingebrigtsen wrote: > > And then I thought -- well, if we should have a literal syntax for > > Unicode control characters, why not for all of them? > Something like that would make sense. The escape sequence should bracket > the name, so that the escape sequences could be used in strings without > ambiguity. Something like \u[NAME], say. Unicode's UTS#18 (http://unicode.org/reports/tr18/) proposes \N[NAME] instead, albeit in the context of regular expressions. Perhaps we should adopt that instead. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-03 20:48 ` Eli Zaretskii @ 2016-03-03 23:58 ` Paul Eggert 0 siblings, 0 replies; 47+ messages in thread From: Paul Eggert @ 2016-03-03 23:58 UTC (permalink / raw) To: Eli Zaretskii; +Cc: larsi, emacs-devel On 03/03/2016 12:48 PM, Eli Zaretskii wrote: > Unicode's UTS#18 (http://unicode.org/reports/tr18/) proposes \N[NAME] > instead, albeit in the context of regular expressions. Perhaps we > should adopt that instead. Sure, that works. Except as I read it, they're proposing curly braces, e.g.: "Use Greek capital letters (\N{GREEK CAPITAL LETTER ALPHA}\N{EN DASH}\N{GREEK CAPITAL LETTER OMEGA}) to denote figures." Since Elisp caters to strings that cross line boundaries, it looks like lread.c's read_escape should allow arbitrary nonempty white space in places where the official Unicode name contains a single space. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-03 16:11 ` Paul Eggert 2016-03-03 20:48 ` Eli Zaretskii @ 2016-03-05 15:28 ` Philipp Stephani 2016-03-05 15:39 ` Marcin Borkowski 2016-03-06 2:27 ` John Wiegley 2016-03-05 16:35 ` Clément Pit--Claudel 2 siblings, 2 replies; 47+ messages in thread From: Philipp Stephani @ 2016-03-05 15:28 UTC (permalink / raw) To: Paul Eggert, Lars Ingebrigtsen, emacs-devel [-- Attachment #1: Type: text/plain, Size: 1367 bytes --] Paul Eggert <eggert@cs.ucla.edu> schrieb am Do., 3. März 2016 um 17:11 Uhr: > On 03/02/2016 09:47 PM, Lars Ingebrigtsen wrote: > > And then I thought -- well, if we should have a literal syntax for > > Unicode control characters, why not for all of them? > Something like that would make sense. The escape sequence should bracket > the name, so that the escape sequences could be used in strings without > ambiguity. Something like \u[NAME], say. > > I'd still prefer to use characters as-is in strings if they're > displayable, e.g., the Lisp string: > > "Use Greek capital letters (Α–Ω) to denote figures." > > is more readable than: > > "Use Greek capital letters (\u[GREEK CAPITAL LETTER ALPHA]\u[EN > DASH]\u[GREEK CAPITAL LETTER OMEGA]) to denote figures." > > But for undisplayable or hard-to-read characters the escape sequence > would be a win. > > More issues: should we insist on the full official name? should we allow > obsolescent aliases? lower-case instead of upper case? initial prefixes > of names? > > We should probably do whatever Perl does ( http://perldoc.perl.org/charnames.html). I haven't checked in detail what is allowed by Perl (except that it allows \N{name} and \N{U+code}), but it would be simpler to just adopt Perl's behavior (to a reasonable extend) than trying to come up with our own syntax. [-- Attachment #2: Type: text/html, Size: 1776 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 15:28 ` Philipp Stephani @ 2016-03-05 15:39 ` Marcin Borkowski 2016-03-05 16:51 ` Philipp Stephani 2016-03-06 2:27 ` John Wiegley 1 sibling, 1 reply; 47+ messages in thread From: Marcin Borkowski @ 2016-03-05 15:39 UTC (permalink / raw) To: Philipp Stephani; +Cc: Lars Ingebrigtsen, Paul Eggert, emacs-devel On 2016-03-05, at 16:28, Philipp Stephani <p.stephani2@gmail.com> wrote: > We should probably do whatever Perl does This is a slippery slope. (SCNR;-).) Best, -- Marcin Borkowski http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski Faculty of Mathematics and Computer Science Adam Mickiewicz University ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 15:39 ` Marcin Borkowski @ 2016-03-05 16:51 ` Philipp Stephani 0 siblings, 0 replies; 47+ messages in thread From: Philipp Stephani @ 2016-03-05 16:51 UTC (permalink / raw) To: Marcin Borkowski; +Cc: Lars Ingebrigtsen, Paul Eggert, emacs-devel [-- Attachment #1: Type: text/plain, Size: 525 bytes --] Marcin Borkowski <mbork@mbork.pl> schrieb am Sa., 5. März 2016 um 16:40 Uhr: > > On 2016-03-05, at 16:28, Philipp Stephani <p.stephani2@gmail.com> wrote: > > > We should probably do whatever Perl does > > This is a slippery slope. > > (SCNR;-).) > > Best, > > Yes, and I certainly wouldn't introduce e.g. custom aliases (given the lack of reader macros such a change would be very unconventional). I'd vote for implementing "full" support and some of the aspects of "loose" support, as suggested by Paul. [-- Attachment #2: Type: text/html, Size: 908 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 15:28 ` Philipp Stephani 2016-03-05 15:39 ` Marcin Borkowski @ 2016-03-06 2:27 ` John Wiegley 2016-03-06 15:24 ` Philipp Stephani 1 sibling, 1 reply; 47+ messages in thread From: John Wiegley @ 2016-03-06 2:27 UTC (permalink / raw) To: Philipp Stephani; +Cc: Lars Ingebrigtsen, Paul Eggert, emacs-devel >>>>> Philipp Stephani <p.stephani2@gmail.com> writes: > We should probably do whatever Perl does > (http://perldoc.perl.org/charnames.html). I haven't checked in detail what > is allowed by Perl (except that it allows \N{name} and \N{U+code}), but it > would be simpler to just adopt Perl's behavior (to a reasonable extend) than > trying to come up with our own syntax. This is a pretty reasonable request, to avoid having to remember multiple syntaxes as much as possible. My life is already shorter from having to correct misuses of \(\|\). -- John Wiegley GPG fingerprint = 4710 CF98 AF9B 327B B80F http://newartisans.com 60E1 46C4 BD1A 7AC1 4BA2 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 2:27 ` John Wiegley @ 2016-03-06 15:24 ` Philipp Stephani 2016-03-06 15:54 ` Eli Zaretskii 0 siblings, 1 reply; 47+ messages in thread From: Philipp Stephani @ 2016-03-06 15:24 UTC (permalink / raw) To: John Wiegley, Paul Eggert, Lars Ingebrigtsen, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 729 bytes --] John Wiegley <jwiegley@gmail.com> schrieb am So., 6. März 2016 um 03:58 Uhr: > >>>>> Philipp Stephani <p.stephani2@gmail.com> writes: > > > We should probably do whatever Perl does > > (http://perldoc.perl.org/charnames.html). I haven't checked in detail > what > > is allowed by Perl (except that it allows \N{name} and \N{U+code}), but > it > > would be simpler to just adopt Perl's behavior (to a reasonable extend) > than > > trying to come up with our own syntax. > > This is a pretty reasonable request, to avoid having to remember multiple > syntaxes as much as possible. My life is already shorter from having to > correct misuses of \(\|\). > > I've attached a patch with an initial implementation. [-- Attachment #1.2: Type: text/html, Size: 1206 bytes --] [-- Attachment #2: 0001-Implement-named-character-escapes-similar-to-Perl.patch --] [-- Type: application/octet-stream, Size: 7114 bytes --] From 46540682975d85eecfffa3d553922abdedcdd9c1 Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 6 Mar 2016 16:16:29 +0100 Subject: [PATCH] Implement named character escapes, similar to Perl * lread.c (init_character_names): New function. (read_escape): Read Perl-style named character escape sequences. (syms_of_lread): Initialize new variable `character_names'. * test/src/lread-tests.el (lread-char-empty-name): Add test file for src/lread.c. --- src/lread.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++ test/src/lread-tests.el | 54 ++++++++++++++++++++++++++++ 2 files changed, 150 insertions(+) create mode 100644 test/src/lread-tests.el diff --git a/src/lread.c b/src/lread.c index 25e3ff0..693de32 100644 --- a/src/lread.c +++ b/src/lread.c @@ -43,6 +43,7 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ #include "systime.h" #include "termhooks.h" #include "blockinput.h" +#include <c-ctype.h> #ifdef MSDOS #include "msdos.h" @@ -2150,6 +2151,36 @@ grow_read_buffer (void) MAX_MULTIBYTE_LENGTH, -1, 1); } +/* Hash table that maps Unicode character names to code points. */ +static Lisp_Object character_names; + +/* Length of the longest Unicode character name, in bytes. */ +static ptrdiff_t max_character_name_length; + +/* Initializes `character_names' and `max_character_name_length'. + Called by `read_escape'. */ +void init_character_names () +{ + character_names = CALLN (Fmake_hash_table, + QCtest, Qequal, + /* Currently around 100,000 Unicode + characters are defined. */ + QCsize, make_natnum (100000)); + const Lisp_Object get_property = + Fsymbol_function (intern_c_string ("get-char-code-property")); + ptrdiff_t length = 0; + for (int i = 0; i <= 0x10FFFF; ++i) + { + const Lisp_Object code = make_natnum (i); + const Lisp_Object name = call2 (get_property, code, Qname); + if (NILP (name)) continue; + CHECK_STRING (name); + length = max (length, SBYTES (name)); + Fputhash (name, code, character_names); + } + max_character_name_length = length; +} + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, return eight-bit char. */ @@ -2357,6 +2388,68 @@ read_escape (Lisp_Object readcharfun, bool stringp) return i; } + case 'N': + /* Named character. */ + { + c = READCHAR; + if (c != '{') + invalid_syntax ("Expected opening brace after \\N"); + if (NILP (character_names)) + init_character_names (); + USE_SAFE_ALLOCA; + char *name = SAFE_ALLOCA (max_character_name_length + 1); + bool whitespace = false; + ptrdiff_t length = 0; + while (true) + { + c = READCHAR; + if (c < 0) + end_of_file_error (); + if (c == '}') + break; + if (! c_isascii (c)) + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, + build_pure_c_string ("Non-ASCII character U+%04X" + " in character name"), + make_natnum (c))); + /* We treat multiple adjacent whitespace characters as a + single space character. This makes it easier to use + character names in e.g. multi-line strings. */ + if (c_isspace (c)) + { + if (! whitespace) + { + whitespace = true; + name[length++] = ' '; + } + } + else + { + whitespace = false; + name[length++] = c; + } + if (length >= max_character_name_length) + invalid_syntax ("Character name too long"); + } + if (length == 0) + invalid_syntax ("Empty character name"); + name[length] = 0; + const Lisp_Object lisp_name = make_unibyte_string (name, length); + const Lisp_Object code = + (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? + /* Code point as U+N, where N is between 1 and 8 hexadecimal + digits. */ + string_to_number (name + 2, 16, false) : + Fgethash (lisp_name, character_names, Qnil); + SAFE_FREE (); + if (! RANGED_INTEGERP (0, code, 0x10FFFF)) + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, + build_pure_c_string ("\\N{%s}"), lisp_name)); + return XINT (code); + } + default: return c; } @@ -4745,4 +4838,7 @@ that are loaded before your customizations are read! */); DEFSYM (Qweakness, "weakness"); DEFSYM (Qrehash_size, "rehash-size"); DEFSYM (Qrehash_threshold, "rehash-threshold"); + + character_names = Qnil; + staticpro (&character_names); } diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el new file mode 100644 index 0000000..1f87334 --- /dev/null +++ b/test/src/lread-tests.el @@ -0,0 +1,54 @@ +;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*- + +;; Copyright (C) 2016 Google Inc. + +;; Author: Philipp Stephani <phst@google.com> + +;; This file is part of GNU Emacs. + +;; This program is free software; you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; This program is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with this program. If not, see <http://www.gnu.org/licenses/>. + +;;; Commentary: + +;; Unit tests for code in src/lread.c. + +;;; Code: + +(ert-deftest lread-char-number () + (should (equal ?\N{U+A817} #xA817))) + +(ert-deftest lread-char-name () + (should (equal ?\N{SYLOTI NAGRI LETTER + DHO} + #xA817))) + +(ert-deftest lread-char-invalid-number () + (should-error (read "?\\N{U+110000}") :type 'invalid-read-syntax)) + +(ert-deftest lread-char-invalid-name () + (should-error (read "?\\N{DOES NOT EXIST}")) :type 'invalid-read-syntax) + +(ert-deftest lread-char-non-ascii-name () + (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}")) 'invalid-read-syntax) + +(ert-deftest lread-char-empty-name () + (should-error (read "?\\N{}")) 'invalid-read-syntax) + +(ert-deftest lread-string-char-number () + (should (equal "a\N{U+A817}b" "a\uA817b"))) + +(ert-deftest lread-string-char-name () + (should (equal "a\N{SYLOTI NAGRI LETTER DHO}b" "a\uA817b"))) + +;;; lread-tests.el ends here -- 2.7.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 15:24 ` Philipp Stephani @ 2016-03-06 15:54 ` Eli Zaretskii 2016-03-06 17:35 ` Philipp Stephani 0 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-03-06 15:54 UTC (permalink / raw) To: Philipp Stephani; +Cc: larsi, johnw, emacs-devel, eggert > From: Philipp Stephani <p.stephani2@gmail.com> > Date: Sun, 06 Mar 2016 15:24:47 +0000 > > I've attached a patch with an initial implementation. Thanks. > +/* Hash table that maps Unicode character names to code points. */ > +static Lisp_Object character_names; > + > +/* Length of the longest Unicode character name, in bytes. */ > +static ptrdiff_t max_character_name_length; > + > +/* Initializes `character_names' and `max_character_name_length'. > + Called by `read_escape'. */ I wonder if there's a better way, in particular with a smaller memory footprint. Doesn't map-char-table work well enough to avoid generating all the names up front? > + if (! RANGED_INTEGERP (0, code, 0x10FFFF)) This should use MAX_UNICODE_CHAR. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 15:54 ` Eli Zaretskii @ 2016-03-06 17:35 ` Philipp Stephani 2016-03-06 18:08 ` Paul Eggert 0 siblings, 1 reply; 47+ messages in thread From: Philipp Stephani @ 2016-03-06 17:35 UTC (permalink / raw) To: Eli Zaretskii; +Cc: eggert, larsi, johnw, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 1105 bytes --] Eli Zaretskii <eliz@gnu.org> schrieb am So., 6. März 2016 um 16:54 Uhr: > > From: Philipp Stephani <p.stephani2@gmail.com> > > Date: Sun, 06 Mar 2016 15:24:47 +0000 > > > > I've attached a patch with an initial implementation. > > Thanks. > > > +/* Hash table that maps Unicode character names to code points. */ > > +static Lisp_Object character_names; > > + > > +/* Length of the longest Unicode character name, in bytes. */ > > +static ptrdiff_t max_character_name_length; > > + > > +/* Initializes `character_names' and `max_character_name_length'. > > + Called by `read_escape'. */ > > I wonder if there's a better way, in particular with a smaller memory > footprint. Doesn't map-char-table work well enough to avoid > generating all the names up front? > It doesn't seem to work; for some reason the Unicode name table appears very small (only 136 code points) when map-char-table is called from C and lacks most characters. > > > + if (! RANGED_INTEGERP (0, code, 0x10FFFF)) > > This should use MAX_UNICODE_CHAR. > > Done, attached a new patch. [-- Attachment #1.2: Type: text/html, Size: 1701 bytes --] [-- Attachment #2: 0001-Implement-named-character-escapes-similar-to-Perl.patch --] [-- Type: application/octet-stream, Size: 7130 bytes --] From 22e299cd23a72a072461befa30a04bf557aecac8 Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 6 Mar 2016 16:16:29 +0100 Subject: [PATCH] Implement named character escapes, similar to Perl * lread.c (init_character_names): New function. (read_escape): Read Perl-style named character escape sequences. (syms_of_lread): Initialize new variable `character_names'. * test/src/lread-tests.el (lread-char-empty-name): Add test file for src/lread.c. --- src/lread.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++ test/src/lread-tests.el | 54 ++++++++++++++++++++++++++++ 2 files changed, 150 insertions(+) create mode 100644 test/src/lread-tests.el diff --git a/src/lread.c b/src/lread.c index 25e3ff0..6e84fc8 100644 --- a/src/lread.c +++ b/src/lread.c @@ -43,6 +43,7 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ #include "systime.h" #include "termhooks.h" #include "blockinput.h" +#include <c-ctype.h> #ifdef MSDOS #include "msdos.h" @@ -2150,6 +2151,36 @@ grow_read_buffer (void) MAX_MULTIBYTE_LENGTH, -1, 1); } +/* Hash table that maps Unicode character names to code points. */ +static Lisp_Object character_names; + +/* Length of the longest Unicode character name, in bytes. */ +static ptrdiff_t max_character_name_length; + +/* Initializes `character_names' and `max_character_name_length'. + Called by `read_escape'. */ +void init_character_names () +{ + character_names = CALLN (Fmake_hash_table, + QCtest, Qequal, + /* Currently around 100,000 Unicode + characters are defined. */ + QCsize, make_natnum (100000)); + const Lisp_Object get_property = + Fsymbol_function (intern_c_string ("get-char-code-property")); + ptrdiff_t length = 0; + for (int i = 0; i <= MAX_UNICODE_CHAR; ++i) + { + const Lisp_Object code = make_natnum (i); + const Lisp_Object name = call2 (get_property, code, Qname); + if (NILP (name)) continue; + CHECK_STRING (name); + length = max (length, SBYTES (name)); + Fputhash (name, code, character_names); + } + max_character_name_length = length; +} + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, return eight-bit char. */ @@ -2357,6 +2388,68 @@ read_escape (Lisp_Object readcharfun, bool stringp) return i; } + case 'N': + /* Named character. */ + { + c = READCHAR; + if (c != '{') + invalid_syntax ("Expected opening brace after \\N"); + if (NILP (character_names)) + init_character_names (); + USE_SAFE_ALLOCA; + char *name = SAFE_ALLOCA (max_character_name_length + 1); + bool whitespace = false; + ptrdiff_t length = 0; + while (true) + { + c = READCHAR; + if (c < 0) + end_of_file_error (); + if (c == '}') + break; + if (! c_isascii (c)) + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, + build_pure_c_string ("Non-ASCII character U+%04X" + " in character name"), + make_natnum (c))); + /* We treat multiple adjacent whitespace characters as a + single space character. This makes it easier to use + character names in e.g. multi-line strings. */ + if (c_isspace (c)) + { + if (! whitespace) + { + whitespace = true; + name[length++] = ' '; + } + } + else + { + whitespace = false; + name[length++] = c; + } + if (length >= max_character_name_length) + invalid_syntax ("Character name too long"); + } + if (length == 0) + invalid_syntax ("Empty character name"); + name[length] = 0; + const Lisp_Object lisp_name = make_unibyte_string (name, length); + const Lisp_Object code = + (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? + /* Code point as U+N, where N is between 1 and 8 hexadecimal + digits. */ + string_to_number (name + 2, 16, false) : + Fgethash (lisp_name, character_names, Qnil); + SAFE_FREE (); + if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR)) + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, + build_pure_c_string ("\\N{%s}"), lisp_name)); + return XINT (code); + } + default: return c; } @@ -4745,4 +4838,7 @@ that are loaded before your customizations are read! */); DEFSYM (Qweakness, "weakness"); DEFSYM (Qrehash_size, "rehash-size"); DEFSYM (Qrehash_threshold, "rehash-threshold"); + + character_names = Qnil; + staticpro (&character_names); } diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el new file mode 100644 index 0000000..1f87334 --- /dev/null +++ b/test/src/lread-tests.el @@ -0,0 +1,54 @@ +;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*- + +;; Copyright (C) 2016 Google Inc. + +;; Author: Philipp Stephani <phst@google.com> + +;; This file is part of GNU Emacs. + +;; This program is free software; you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; This program is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with this program. If not, see <http://www.gnu.org/licenses/>. + +;;; Commentary: + +;; Unit tests for code in src/lread.c. + +;;; Code: + +(ert-deftest lread-char-number () + (should (equal ?\N{U+A817} #xA817))) + +(ert-deftest lread-char-name () + (should (equal ?\N{SYLOTI NAGRI LETTER + DHO} + #xA817))) + +(ert-deftest lread-char-invalid-number () + (should-error (read "?\\N{U+110000}") :type 'invalid-read-syntax)) + +(ert-deftest lread-char-invalid-name () + (should-error (read "?\\N{DOES NOT EXIST}")) :type 'invalid-read-syntax) + +(ert-deftest lread-char-non-ascii-name () + (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}")) 'invalid-read-syntax) + +(ert-deftest lread-char-empty-name () + (should-error (read "?\\N{}")) 'invalid-read-syntax) + +(ert-deftest lread-string-char-number () + (should (equal "a\N{U+A817}b" "a\uA817b"))) + +(ert-deftest lread-string-char-name () + (should (equal "a\N{SYLOTI NAGRI LETTER DHO}b" "a\uA817b"))) + +;;; lread-tests.el ends here -- 2.7.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 17:35 ` Philipp Stephani @ 2016-03-06 18:08 ` Paul Eggert 2016-03-06 18:28 ` Philipp Stephani 0 siblings, 1 reply; 47+ messages in thread From: Paul Eggert @ 2016-03-06 18:08 UTC (permalink / raw) To: Philipp Stephani, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel Thanks for taking this on. Some comments: Why the hash table? Existing Lisp code dealing with Unicode names uses an alist, and it seems to do OK. If a hash table is needed, a hash table should also be used by the existing code elsewhere that does something similar. See the function ucs-names and its callers. If a hash table is needed, I suggest using a perfect hashing function (generated by gperf) and checking its results with get-char-code-property. That avoids the runtime overhead of initialization. It needs documentation, both in the Emacs Lisp manual and in NEWS. > +void init_character_names () > +{ The usual style is: void init_character_names (void) { No need for "const" for local variables (cost exceeds benefit). > if (c_isspace (c)) > { > if (! whitespace) > { > whitespace = true; > name[length++] = ' '; > } > } > else > { > whitespace = false; > name[length++] = c; > } This would be a bit easier to follow (and most likely a tiny bit more efficient) as something like this: > bool ws = c_isspace (c); > if (ws) > { > length -= whitespace; > c = ' '; > } > whitespace = ws; > name[length++] = c; ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 18:08 ` Paul Eggert @ 2016-03-06 18:28 ` Philipp Stephani 2016-03-06 19:03 ` Paul Eggert 0 siblings, 1 reply; 47+ messages in thread From: Philipp Stephani @ 2016-03-06 18:28 UTC (permalink / raw) To: Paul Eggert, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 2388 bytes --] Paul Eggert <eggert@cs.ucla.edu> schrieb am So., 6. März 2016 um 19:08 Uhr: > Thanks for taking this on. Some comments: > > Why the hash table? Existing Lisp code dealing with Unicode names uses an > alist, > and it seems to do OK. Hash tables are as easy to use as alists, but have average O(1) lookup time, as opposed to O(n) time for alists. Also alists are more prone to cache invalidation because they are less contiguous. > If a hash table is needed, a hash table should also be > used by the existing code elsewhere that does something similar. See the > function ucs-names and its callers. > Initially I used ucs-names, but the decided against it because it lacks most characters. That's OK for a tables used for completion, but for inputting all characters should be present. So the use cases are different. > > If a hash table is needed, I suggest using a perfect hashing function > (generated > by gperf) and checking its results with get-char-code-property. That > avoids the > runtime overhead of initialization. > Sounds good, but that would require much more effort and would delay this project unnecessarily. It can be done later once the basic functionality is in place. > > It needs documentation, both in the Emacs Lisp manual and in NEWS. > > Yes, I've attached a patch. > > > +void init_character_names () > > +{ > > The usual style is: > > void > init_character_names (void) > { > > > No need for "const" for local variables (cost exceeds benefit). > Removed. > > > > if (c_isspace (c)) > > { > > if (! whitespace) > > { > > whitespace = true; > > name[length++] = ' '; > > } > > } > > else > > { > > whitespace = false; > > name[length++] = c; > > } > > This would be a bit easier to follow (and most likely a tiny bit more > efficient) > as something like this: > > > bool ws = c_isspace (c); > > if (ws) > > { > > length -= whitespace; > > c = ' '; > > } > > whitespace = ws; > > name[length++] = c; > > I'd rather not have length decrease. Moved out the assignment, though. [-- Attachment #1.2: Type: text/html, Size: 3746 bytes --] [-- Attachment #2: 0002-Add-documentation-for-character-name-escapes.patch --] [-- Type: application/octet-stream, Size: 2348 bytes --] From d0d5219a358a2d8e853f1ce11cf16fb2629697c6 Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 6 Mar 2016 19:07:06 +0100 Subject: [PATCH 2/2] Add documentation for character name escapes --- doc/lispref/nonascii.texi | 2 +- doc/lispref/objects.texi | 10 ++++++++++ etc/NEWS | 5 +++++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 9cf3b57..66ad9ac 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -633,7 +633,7 @@ Character Properties @end group @group ;; U+2163 ROMAN NUMERAL FOUR -(get-char-code-property ?\u2163 'numeric-value) +(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value) @result{} 4 @end group @group diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi index 3245930..96b334d 100644 --- a/doc/lispref/objects.texi +++ b/doc/lispref/objects.texi @@ -387,6 +387,16 @@ General Escape Syntax for the character @kbd{C-b}. Only characters up to octal code 777 can be specified this way. + Fourthly, you can specify characters by their name. A character +name escape sequence consists of a backslash, @samp{N@{}, the Unicode +character name, and @samp{@}}. Alternatively, you can also put the +numeric code point value between the braces, using the syntax +@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight +hexadecimal digits. Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and +@samp{?\N@{U+41@}} both denote the character @kbd{A}. To simplify +entering multi-line strings, you can replace spaces in the character +names by arbitrary non-empty sequence of whitespace (e.g., newlines). + These escape sequences may also be used in strings. @xref{Non-ASCII in Strings}. diff --git a/etc/NEWS b/etc/NEWS index 92d69d2..9c77474 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -159,6 +159,11 @@ that negotiation should complete even on non-blocking sockets. `window-pixel-height-before-size-change' allow to detect which window changed size when `window-size-change-functions' are run. ++++ +** Emacs now supports character name escape sequences in character and +string literals. The syntax variants \N{character name} and +\N{U+code} are supported. + \f * Changes in Emacs 25.2 on Non-Free Operating Systems -- 2.7.0 [-- Attachment #3: 0003-Minor-cleanups-for-character-name-escapes.patch --] [-- Type: application/octet-stream, Size: 2923 bytes --] From 30e6d9dd4e83a36fe07bbeae678b3f086773346e Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 6 Mar 2016 19:27:21 +0100 Subject: [PATCH 3/3] Minor cleanups for character name escapes. * lread.c (init_character_names): Add missing `void'. Remove top-level `const'. (read_escape): Simplify loop a bit. Remove top-level `const'. --- src/lread.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/src/lread.c b/src/lread.c index 6e84fc8..4000637 100644 --- a/src/lread.c +++ b/src/lread.c @@ -2159,20 +2159,20 @@ static ptrdiff_t max_character_name_length; /* Initializes `character_names' and `max_character_name_length'. Called by `read_escape'. */ -void init_character_names () +void init_character_names (void) { character_names = CALLN (Fmake_hash_table, QCtest, Qequal, /* Currently around 100,000 Unicode characters are defined. */ QCsize, make_natnum (100000)); - const Lisp_Object get_property = + Lisp_Object get_property = Fsymbol_function (intern_c_string ("get-char-code-property")); ptrdiff_t length = 0; for (int i = 0; i <= MAX_UNICODE_CHAR; ++i) { - const Lisp_Object code = make_natnum (i); - const Lisp_Object name = call2 (get_property, code, Qname); + Lisp_Object code = make_natnum (i); + Lisp_Object name = call2 (get_property, code, Qname); if (NILP (name)) continue; CHECK_STRING (name); length = max (length, SBYTES (name)); @@ -2418,25 +2418,22 @@ read_escape (Lisp_Object readcharfun, bool stringp) character names in e.g. multi-line strings. */ if (c_isspace (c)) { - if (! whitespace) - { - whitespace = true; - name[length++] = ' '; - } + if (whitespace) + continue; + c = ' '; + whitespace = true; } else - { - whitespace = false; - name[length++] = c; - } + whitespace = false; + name[length++] = c; if (length >= max_character_name_length) invalid_syntax ("Character name too long"); } if (length == 0) invalid_syntax ("Empty character name"); name[length] = 0; - const Lisp_Object lisp_name = make_unibyte_string (name, length); - const Lisp_Object code = + Lisp_Object lisp_name = make_unibyte_string (name, length); + Lisp_Object code = (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? /* Code point as U+N, where N is between 1 and 8 hexadecimal digits. */ -- 2.7.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 18:28 ` Philipp Stephani @ 2016-03-06 19:03 ` Paul Eggert 2016-03-06 19:16 ` Philipp Stephani 0 siblings, 1 reply; 47+ messages in thread From: Paul Eggert @ 2016-03-06 19:03 UTC (permalink / raw) To: Philipp Stephani, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel Philipp Stephani wrote: > Initially I used ucs-names, but the decided against it because it lacks > most characters. Can you describe in general terms the difference between what's in ucs-names and what's in the new hash table? Should the two things be unified? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 19:03 ` Paul Eggert @ 2016-03-06 19:16 ` Philipp Stephani 2016-03-06 20:05 ` Eli Zaretskii 0 siblings, 1 reply; 47+ messages in thread From: Philipp Stephani @ 2016-03-06 19:16 UTC (permalink / raw) To: Paul Eggert, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel [-- Attachment #1: Type: text/plain, Size: 1221 bytes --] Paul Eggert <eggert@cs.ucla.edu> schrieb am So., 6. März 2016 um 20:03 Uhr: > Philipp Stephani wrote: > > Initially I used ucs-names, but the decided against it because it lacks > > most characters. > > Can you describe in general terms the difference between what's in > ucs-names and > what's in the new hash table? Should the two things be unified? > ucs-names uses a whitelist of ranges to consider: '((#x0000 . #x33FF) ;; (#x3400 . #x4DBF) CJK Ideographs Extension A (#x4DC0 . #x4DFF) ;; (#x4E00 . #x9FFF) CJK Unified Ideographs (#xA000 . #xD7FF) ;; (#xD800 . #xFAFF) Surrogate/Private (#xFB00 . #x134FF) ;; (#x13500 . #x167FF) unused (#x16800 . #x16A3F) ;; (#x16A40 . #x1AFFF) unused (#x1B000 . #x1B0FF) ;; (#x1B100 . #x1CFFF) unused (#x1D000 . #x1FFFF) ;; (#x20000 . #xDFFFF) CJK Ideograph Extension A, B, etc, unused (#xE0000 . #xE01FF)) This is probably for practical purposes (no point in showing thousands of "CJK UNIFIED IDEOGRAPH-xyz" completions). For a character escape these considerations don't apply, and it would be very surprising and confusing to not accept all characters. [-- Attachment #2: Type: text/html, Size: 2664 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 19:16 ` Philipp Stephani @ 2016-03-06 20:05 ` Eli Zaretskii 2016-03-13 20:31 ` Philipp Stephani 0 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-03-06 20:05 UTC (permalink / raw) To: Philipp Stephani; +Cc: larsi, eggert, johnw, emacs-devel > From: Philipp Stephani <p.stephani2@gmail.com> > Date: Sun, 06 Mar 2016 19:16:37 +0000 > Cc: larsi@gnus.org, johnw@gnu.org, emacs-devel@gnu.org > > This is probably for practical purposes (no point in showing thousands of "CJK UNIFIED IDEOGRAPH-xyz" > completions). For a character escape these considerations don't apply, and it would be very surprising and > confusing to not accept all characters. The only characters that ucs-names omits are CJK ideographs, whose codepoints can be computed from the name algorithmically. All the others are non-characters, right? So why it won't be a good idea to simply use ucs-names? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-06 20:05 ` Eli Zaretskii @ 2016-03-13 20:31 ` Philipp Stephani 2016-03-14 20:03 ` Paul Eggert 0 siblings, 1 reply; 47+ messages in thread From: Philipp Stephani @ 2016-03-13 20:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: larsi, eggert, johnw, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 805 bytes --] Eli Zaretskii <eliz@gnu.org> schrieb am So., 6. März 2016 um 21:05 Uhr: > > From: Philipp Stephani <p.stephani2@gmail.com> > > Date: Sun, 06 Mar 2016 19:16:37 +0000 > > Cc: larsi@gnus.org, johnw@gnu.org, emacs-devel@gnu.org > > > > This is probably for practical purposes (no point in showing thousands > of "CJK UNIFIED IDEOGRAPH-xyz" > > completions). For a character escape these considerations don't apply, > and it would be very surprising and > > confusing to not accept all characters. > > The only characters that ucs-names omits are CJK ideographs, whose > codepoints can be computed from the name algorithmically. All the > others are non-characters, right? So why it won't be a good idea to > simply use ucs-names? > > I've attached another patch to switch to ucs-names. [-- Attachment #1.2: Type: text/html, Size: 1355 bytes --] [-- Attachment #2: 0004-Use-ucs-names.patch --] [-- Type: application/octet-stream, Size: 8172 bytes --] From 0481b16cdcd7c2b4c1a877f8a01e569ced99d1ac Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 13 Mar 2016 21:27:30 +0100 Subject: [PATCH 4/4] Use `ucs-names'. * lread.c (invalid_character_name, check_scalar_value) (parse_code_after_prefix, character_name_to_code): New helper functions that use `ucs-names' and parsing for CJK ideographs. (read_escape): Use helper functions. (syms_of_lread): New symbol `ucs-names'. * test/src/lread-tests.el: New tests; fix a couple of bugs in existing tests. --- src/lread.c | 122 +++++++++++++++++++++++++++++++----------------- test/src/lread-tests.el | 11 ++++- 2 files changed, 87 insertions(+), 46 deletions(-) diff --git a/src/lread.c b/src/lread.c index 4000637..567c071 100644 --- a/src/lread.c +++ b/src/lread.c @@ -44,6 +44,8 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ #include "termhooks.h" #include "blockinput.h" #include <c-ctype.h> +#include <string.h> +#include <stdnoreturn.h> #ifdef MSDOS #include "msdos.h" @@ -2151,34 +2153,81 @@ grow_read_buffer (void) MAX_MULTIBYTE_LENGTH, -1, 1); } -/* Hash table that maps Unicode character names to code points. */ -static Lisp_Object character_names; +/* Signals an `invalid-read-syntax' error indicating that the + character name in an \N{...} literal is invalid. */ +static noreturn void invalid_character_name (Lisp_Object name) +{ + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, build_pure_c_string ("\\N{%s}"), name)); +} -/* Length of the longest Unicode character name, in bytes. */ -static ptrdiff_t max_character_name_length; +/* Checks that CODE is a valid Unicode scalar value, and returns its + value. CODE should be parsed from the character name given by + NAME. NAME is used for error messages. */ +static int check_scalar_value (Lisp_Object code, Lisp_Object name) +{ + if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR) || + /* Don't allow surrogates. */ + RANGED_INTEGERP (0xD800, code, 0xDFFF)) + invalid_character_name (name); + return XINT (code); +} -/* Initializes `character_names' and `max_character_name_length'. - Called by `read_escape'. */ -void init_character_names (void) +/* If NAME starts with PREFIX, interpret the rest as a hexadecimal + number and return its value. Raises `invalid-read-syntax' if the + number is not a valid scalar value. Returns -1 if NAME doesn't + start with PREFIX. */ +static int +parse_code_after_prefix (Lisp_Object name, const char* prefix) { - character_names = CALLN (Fmake_hash_table, - QCtest, Qequal, - /* Currently around 100,000 Unicode - characters are defined. */ - QCsize, make_natnum (100000)); - Lisp_Object get_property = - Fsymbol_function (intern_c_string ("get-char-code-property")); - ptrdiff_t length = 0; - for (int i = 0; i <= MAX_UNICODE_CHAR; ++i) + ptrdiff_t name_len = SBYTES (name); + ptrdiff_t prefix_len = strlen (prefix); + /* Allow between one and eight hexadecimal digits after the + prefix. */ + if (name_len > prefix_len && name_len <= prefix_len + 8 + && memcmp (SDATA (name), prefix, prefix_len) == 0) { - Lisp_Object code = make_natnum (i); - Lisp_Object name = call2 (get_property, code, Qname); - if (NILP (name)) continue; - CHECK_STRING (name); - length = max (length, SBYTES (name)); - Fputhash (name, code, character_names); + Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false); + if (! NILP (code)) + return check_scalar_value (code, name); + } + return -1; +} + +/* Returns the scalar value that has the Unicode character name NAME. + Raises `invalid-read-syntax' if there is no such character. */ +static int +character_name_to_code (Lisp_Object name) +{ + /* Code point as U+N, where N is between 1 and 8 hexadecimal + digits. */ + int code = parse_code_after_prefix (name, "U+"); + if (code >= 0) + return code; + + /* CJK ideographs are not contained in the association list returned + by `ucs-names'. But they follow a predictable naming pattern: a + fixed prefix plus the hexadecimal codepoint value. */ + code = parse_code_after_prefix (name, "CJK IDEOGRAPH-"); + if (code >= 0) + { + /* Various ranges of CJK characters; see UnicodeData.txt. */ + if ((code >= 0x3400 && code <= 0x4DB5) || + (code >= 0x4E00 && code <= 0x9FD5) || + (code >= 0x20000 && code <= 0x2A6D6) || + (code >= 0x2A700 && code <= 0x2B734) || + (code >= 0x2B740 && code <= 0x2B81D) || + (code >= 0x2B820 && code <= 0x2CEA1)) + return code; + else + invalid_character_name (name); } - max_character_name_length = length; + + /* Look up the name in the table returned by `ucs-names'. */ + Lisp_Object names = call0 (Qucs_names); + if (! CONSP (names)) + invalid_syntax ("Unicode character name database not loaded"); + return check_scalar_value (CDR (Fassoc (name, names)), name); } /* Read a \-escape sequence, assuming we already read the `\'. @@ -2394,10 +2443,9 @@ read_escape (Lisp_Object readcharfun, bool stringp) c = READCHAR; if (c != '{') invalid_syntax ("Expected opening brace after \\N"); - if (NILP (character_names)) - init_character_names (); - USE_SAFE_ALLOCA; - char *name = SAFE_ALLOCA (max_character_name_length + 1); + /* 200 characters is hopefully long enough. Increase if + not. */ + char name[200]; bool whitespace = false; ptrdiff_t length = 0; while (true) @@ -2426,25 +2474,12 @@ read_escape (Lisp_Object readcharfun, bool stringp) else whitespace = false; name[length++] = c; - if (length >= max_character_name_length) + if (length >= sizeof name) invalid_syntax ("Character name too long"); } if (length == 0) invalid_syntax ("Empty character name"); - name[length] = 0; - Lisp_Object lisp_name = make_unibyte_string (name, length); - Lisp_Object code = - (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? - /* Code point as U+N, where N is between 1 and 8 hexadecimal - digits. */ - string_to_number (name + 2, 16, false) : - Fgethash (lisp_name, character_names, Qnil); - SAFE_FREE (); - if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR)) - xsignal1 (Qinvalid_read_syntax, - CALLN (Fformat, - build_pure_c_string ("\\N{%s}"), lisp_name)); - return XINT (code); + return character_name_to_code (make_unibyte_string (name, length)); } default: @@ -4836,6 +4871,5 @@ that are loaded before your customizations are read! */); DEFSYM (Qrehash_size, "rehash-size"); DEFSYM (Qrehash_threshold, "rehash-threshold"); - character_names = Qnil; - staticpro (&character_names); + DEFSYM (Qucs_names, "ucs-names"); } diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el index 1f87334..ff5d0f6 100644 --- a/test/src/lread-tests.el +++ b/test/src/lread-tests.el @@ -40,10 +40,17 @@ (should-error (read "?\\N{DOES NOT EXIST}")) :type 'invalid-read-syntax) (ert-deftest lread-char-non-ascii-name () - (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}")) 'invalid-read-syntax) + (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}") + :type 'invalid-read-syntax)) (ert-deftest lread-char-empty-name () - (should-error (read "?\\N{}")) 'invalid-read-syntax) + (should-error (read "?\\N{}") :type 'invalid-read-syntax)) + +(ert-deftest lread-char-cjk-name () + (should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734))) + +(ert-deftest lread-char-invalid-cjk-name () + (should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax)) (ert-deftest lread-string-char-number () (should (equal "a\N{U+A817}b" "a\uA817b"))) -- 2.7.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-13 20:31 ` Philipp Stephani @ 2016-03-14 20:03 ` Paul Eggert 2016-03-14 20:30 ` Eli Zaretskii ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Paul Eggert @ 2016-03-14 20:03 UTC (permalink / raw) To: Philipp Stephani, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel Thanks, here's a detailed low level review. > Subject: [PATCH 4/4] Use `ucs-names'. Summary lines like "Use `ucs-names'." should not end with "." and should be as informative as possible within a 50-char limit. > +#include <stdnoreturn.h> This include reportedly doesn't work well with Microsoft compilers. Omit it and use _Noreturn instead of noreturn. > +/* Signals an `invalid-read-syntax' error indicating that the > + character name in an \N{...} literal is invalid. */ Use active voice "Signal an" rather than a non-sentence. Don't use grave quoting in comments (no quoting needed here anyway). > +static noreturn void invalid_character_name (Lisp_Object name) Put "static _Noreturn void" on the first line, and the rest on the next line; that's the usual GNU style. > +/* Checks that CODE is a valid Unicode scalar value, and returns its > + value. CODE should be parsed from the character name given by > + NAME. NAME is used for error messages. */ Active voice: "Checks" -> "Check". > +static int check_scalar_value (Lisp_Object code, Lisp_Object name) "static int" in a separate line. > +{ > + if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR) || > + /* Don't allow surrogates. */ > + RANGED_INTEGERP (0xD800, code, 0xDFFF)) > + invalid_character_name (name); > + return XINT (code); > +} RANGED_INTEGERP implies two tests for integer. Better would be an explicit NUMBERP check, followed by an XINT, followed by C-language range checks. Just use <= or < in range checks (not >= or >). Also, don't put operators like || at the end of a line; put them at the start of the next line instead. > +/* If NAME starts with PREFIX, interpret the rest as a hexadecimal > + number and return its value. Raises `invalid-read-syntax' if the > + number is not a valid scalar value. Returns -1 if NAME doesn't > + start with PREFIX. */ Active voice. No need for grave quoting. > +static int > +parse_code_after_prefix (Lisp_Object name, const char* prefix) "char* x" -> "char *x" in GNU style. > + if (name_len > prefix_len && name_len <= prefix_len + 8 Just use < or <= for range checks. > + Lisp_Object code = string_to_number (SDATA (name) + prefix_len, > 16, false); > + if (! NILP (code)) > + return check_scalar_value (code, name); Why is nil treated differently from other invalid values (e.g., floating-point numbers)? They're all invalid character names, right? > > + /* Various ranges of CJK characters; see UnicodeData.txt. */ > + if ((code >= 0x3400 && code <= 0x4DB5) || > + (code >= 0x4E00 && code <= 0x9FD5) || > + (code >= 0x20000 && code <= 0x2A6D6) || > + (code >= 0x2A700 && code <= 0x2B734) || > + (code >= 0x2B740 && code <= 0x2B81D) || > + (code >= 0x2B820 && code <= 0x2CEA1)) > + return code; Use only <= here, and put || at the start of lines. What's the likelihood that the numbers in the above test will change? > > + if (! CONSP (names)) > + invalid_syntax ("Unicode character name database not loaded"); This test is not needed, as ucs-names always returns a cons, and anyway even if it didn't then Fassoc would do the right thing. > + /* 200 characters is hopefully long enough. Increase if > + not. */ > + char name[200]; Give a name to this constant, e.g., /* Bound on the length of a Unicode character name. As of Unicode 9.0.0 the maximum is 83, so this should be safe. */ enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 199 }; ... char name[UNICODE_CHARACTER_NAME_LENGTH_BOUND + 1]; ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-14 20:03 ` Paul Eggert @ 2016-03-14 20:30 ` Eli Zaretskii 2016-03-15 11:09 ` Nikolai Weibull 2016-03-14 21:27 ` Clément Pit--Claudel 2016-03-19 16:27 ` Philipp Stephani 2 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-03-14 20:30 UTC (permalink / raw) To: Paul Eggert; +Cc: p.stephani2, johnw, larsi, emacs-devel > Cc: larsi@gnus.org, johnw@gnu.org, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Mon, 14 Mar 2016 13:03:38 -0700 > > What's the likelihood that the numbers in the above test will > change? Zero, given the UTC's stability policy. But note that Unicode 9.0.0 adds another range of Ideographs similar to CJK, their names begin with "TANGUT IDEOGRAPH-". > > + /* 200 characters is hopefully long enough. Increase if > > + not. */ > > + char name[200]; > > Give a name to this constant, e.g., > > /* Bound on the length of a Unicode character name. > As of Unicode 9.0.0 the maximum is 83, so this should be safe. */ > enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 199 }; > ... > char name[UNICODE_CHARACTER_NAME_LENGTH_BOUND + 1]; Perhaps we should ask on the Unicode mailing list, I somehow remember seeing a mandatory limit on the length of a character's name. Thanks. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-14 20:30 ` Eli Zaretskii @ 2016-03-15 11:09 ` Nikolai Weibull 2016-03-15 17:10 ` Eli Zaretskii 0 siblings, 1 reply; 47+ messages in thread From: Nikolai Weibull @ 2016-03-15 11:09 UTC (permalink / raw) To: Eli Zaretskii Cc: p.stephani2, Paul Eggert, Lars Ingebrigtsen, johnw, Emacs Developers On Mon, Mar 14, 2016 at 9:30 PM, Eli Zaretskii <eliz@gnu.org> wrote: >> From: Paul Eggert <eggert@cs.ucla.edu> >> /* Bound on the length of a Unicode character name. >> As of Unicode 9.0.0 the maximum is 83, so this should be safe. */ >> enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 199 }; >> ... >> char name[UNICODE_CHARACTER_NAME_LENGTH_BOUND + 1]; > > Perhaps we should ask on the Unicode mailing list, I somehow remember > seeing a mandatory limit on the length of a character's name. No such limit is mentioned in section 4.8. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-15 11:09 ` Nikolai Weibull @ 2016-03-15 17:10 ` Eli Zaretskii 2016-03-16 8:16 ` Nikolai Weibull 0 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-03-15 17:10 UTC (permalink / raw) To: Nikolai Weibull; +Cc: p.stephani2, eggert, larsi, johnw, emacs-devel > Date: Tue, 15 Mar 2016 12:09:50 +0100 > From: Nikolai Weibull <now@disu.se> > Cc: Paul Eggert <eggert@cs.ucla.edu>, p.stephani2@gmail.com, johnw@gnu.org, > Lars Ingebrigtsen <larsi@gnus.org>, Emacs Developers <emacs-devel@gnu.org> > > >> /* Bound on the length of a Unicode character name. > >> As of Unicode 9.0.0 the maximum is 83, so this should be safe. */ > >> enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 199 }; > >> ... > >> char name[UNICODE_CHARACTER_NAME_LENGTH_BOUND + 1]; > > > > Perhaps we should ask on the Unicode mailing list, I somehow remember > > seeing a mandatory limit on the length of a character's name. > > No such limit is mentioned in section 4.8. Indeed, there is none. However, this old discussion: http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML022/0845.html http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML022/0872.html indicates that 128 should be good enough. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-15 17:10 ` Eli Zaretskii @ 2016-03-16 8:16 ` Nikolai Weibull 0 siblings, 0 replies; 47+ messages in thread From: Nikolai Weibull @ 2016-03-16 8:16 UTC (permalink / raw) To: Eli Zaretskii Cc: Paul Eggert, johnw, Emacs Developers, Philipp Stephani, Lars Ingebrigtsen, Nikolai Weibull On Tue, Mar 15, 2016 at 6:10 PM, Eli Zaretskii <eliz@gnu.org> wrote: >> Date: Tue, 15 Mar 2016 12:09:50 +0100 >> From: Nikolai Weibull <now@disu.se> >> Cc: Paul Eggert <eggert@cs.ucla.edu>, p.stephani2@gmail.com, johnw@gnu.org, >> Lars Ingebrigtsen <larsi@gnus.org>, Emacs Developers <emacs-devel@gnu.org> >> >> >> /* Bound on the length of a Unicode character name. >> >> As of Unicode 9.0.0 the maximum is 83, so this should be safe. */ >> >> enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 199 }; >> >> ... >> >> char name[UNICODE_CHARACTER_NAME_LENGTH_BOUND + 1]; >> > >> > Perhaps we should ask on the Unicode mailing list, I somehow remember >> > seeing a mandatory limit on the length of a character's name. >> >> No such limit is mentioned in section 4.8. > Indeed, there is none. However, this old discussion: > > http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML022/0845.html > http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML022/0872.html > > indicates that 128 should be good enough. Given that this has held true for 16 years, I suppose that that limit won’t have to be adjusted anytime soon :-). The shortest name is now “OX”, so that limit has, however, changed in the interim. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-14 20:03 ` Paul Eggert 2016-03-14 20:30 ` Eli Zaretskii @ 2016-03-14 21:27 ` Clément Pit--Claudel 2016-03-14 21:48 ` Paul Eggert 2016-03-19 16:27 ` Philipp Stephani 2 siblings, 1 reply; 47+ messages in thread From: Clément Pit--Claudel @ 2016-03-14 21:27 UTC (permalink / raw) To: emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 151 bytes --] On 03/14/2016 04:03 PM, Paul Eggert wrote: > Active voice: "Checks" -> "Check". Do you mean "First person"? Both sound like active voice to me. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-14 21:27 ` Clément Pit--Claudel @ 2016-03-14 21:48 ` Paul Eggert 0 siblings, 0 replies; 47+ messages in thread From: Paul Eggert @ 2016-03-14 21:48 UTC (permalink / raw) To: Clément Pit--Claudel, emacs-devel On 03/14/2016 02:27 PM, Clément Pit--Claudel wrote: >> Active voice: "Checks" -> "Check". > Do you mean "First person"? Both sound like active voice to me. > Yes, sorry, I meant to say "use the imperative form" actually. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-14 20:03 ` Paul Eggert 2016-03-14 20:30 ` Eli Zaretskii 2016-03-14 21:27 ` Clément Pit--Claudel @ 2016-03-19 16:27 ` Philipp Stephani 2016-03-20 12:58 ` Paul Eggert 2 siblings, 1 reply; 47+ messages in thread From: Philipp Stephani @ 2016-03-19 16:27 UTC (permalink / raw) To: Paul Eggert, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 175 bytes --] Paul Eggert <eggert@cs.ucla.edu> schrieb am Mo., 14. März 2016 um 21:03 Uhr: > Thanks, here's a detailed low level review. > Thanks, all done. New patch is attached. [-- Attachment #1.2: Type: text/html, Size: 447 bytes --] [-- Attachment #2: 0001-Use-ucs-names-for-character-name-escapes.patch --] [-- Type: application/octet-stream, Size: 8343 bytes --] From 173eb8b38b4b495a46bca02779ae590130328175 Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 13 Mar 2016 21:27:30 +0100 Subject: [PATCH] Use `ucs-names' for character name escapes * lread.c (invalid_character_name, check_scalar_value) (parse_code_after_prefix, character_name_to_code): New helper functions that use `ucs-names' and parsing for CJK ideographs. (read_escape): Use helper functions. (syms_of_lread): New symbol `ucs-names'. * test/src/lread-tests.el: New tests; fix a couple of bugs in existing tests. --- src/lread.c | 126 +++++++++++++++++++++++++++++++----------------- test/src/lread-tests.el | 11 ++++- 2 files changed, 91 insertions(+), 46 deletions(-) diff --git a/src/lread.c b/src/lread.c index 4000637..dc023eb 100644 --- a/src/lread.c +++ b/src/lread.c @@ -44,6 +44,7 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ #include "termhooks.h" #include "blockinput.h" #include <c-ctype.h> +#include <string.h> #ifdef MSDOS #include "msdos.h" @@ -2151,36 +2152,90 @@ grow_read_buffer (void) MAX_MULTIBYTE_LENGTH, -1, 1); } -/* Hash table that maps Unicode character names to code points. */ -static Lisp_Object character_names; +/* Signal an invalid-read-syntax error indicating that the character + name in an \N{…} literal is invalid. */ +static _Noreturn void +invalid_character_name (Lisp_Object name) +{ + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, build_pure_c_string ("\\N{%s}"), name)); +} -/* Length of the longest Unicode character name, in bytes. */ -static ptrdiff_t max_character_name_length; +/* Check that CODE is a valid Unicode scalar value, and return its + value. CODE should be parsed from the character name given by + NAME. NAME is used for error messages. */ +static int +check_scalar_value (Lisp_Object code, Lisp_Object name) +{ + if (! NUMBERP (code)) + invalid_character_name (name); + EMACS_INT i = XINT (code); + if (! (0 <= i && i <= MAX_UNICODE_CHAR) + /* Don't allow surrogates. */ + || (0xD800 <= code && code <= 0xDFFF)) + invalid_character_name (name); + return i; +} -/* Initializes `character_names' and `max_character_name_length'. - Called by `read_escape'. */ -void init_character_names (void) +/* If NAME starts with PREFIX, interpret the rest as a hexadecimal + number and return its value. Raise invalid-read-syntax if the + number is not a valid scalar value. Return −1 if NAME doesn’t + start with PREFIX. */ +static int +parse_code_after_prefix (Lisp_Object name, const char *prefix) { - character_names = CALLN (Fmake_hash_table, - QCtest, Qequal, - /* Currently around 100,000 Unicode - characters are defined. */ - QCsize, make_natnum (100000)); - Lisp_Object get_property = - Fsymbol_function (intern_c_string ("get-char-code-property")); - ptrdiff_t length = 0; - for (int i = 0; i <= MAX_UNICODE_CHAR; ++i) + ptrdiff_t name_len = SBYTES (name); + ptrdiff_t prefix_len = strlen (prefix); + /* Allow between one and eight hexadecimal digits after the + prefix. */ + if (prefix_len < name_len && name_len <= prefix_len + 8 + && memcmp (SDATA (name), prefix, prefix_len) == 0) { - Lisp_Object code = make_natnum (i); - Lisp_Object name = call2 (get_property, code, Qname); - if (NILP (name)) continue; - CHECK_STRING (name); - length = max (length, SBYTES (name)); - Fputhash (name, code, character_names); + Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false); + if (NUMBERP (code)) + return check_scalar_value (code, name); } - max_character_name_length = length; + return -1; } +/* Returns the scalar value that has the Unicode character name NAME. + Raises `invalid-read-syntax' if there is no such character. */ +static int +character_name_to_code (Lisp_Object name) +{ + /* Code point as U+N, where N is between 1 and 8 hexadecimal + digits. */ + int code = parse_code_after_prefix (name, "U+"); + if (code >= 0) + return code; + + /* CJK ideographs are not contained in the association list returned + by `ucs-names'. But they follow a predictable naming pattern: a + fixed prefix plus the hexadecimal codepoint value. */ + code = parse_code_after_prefix (name, "CJK IDEOGRAPH-"); + if (code >= 0) + { + /* Various ranges of CJK characters; see UnicodeData.txt. */ + if ((0x3400 <= code && code <= 0x4DB5) + || (0x4E00 <= code && code <= 0x9FD5) + || (0x20000 <= code && code <= 0x2A6D6) + || (0x2A700 <= code && code <= 0x2B734) + || (0x2B740 <= code && code <= 0x2B81D) + || (0x2B820 <= code && code <= 0x2CEA1)) + return code; + else + invalid_character_name (name); + } + + /* Look up the name in the table returned by `ucs-names'. */ + Lisp_Object names = call0 (Qucs_names); + return check_scalar_value (CDR (Fassoc (name, names)), name); +} + +/* Bound on the length of a Unicode character name. As of + Unicode 9.0.0 the maximum is 83, so this should be safe. */ +enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 }; + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, return eight-bit char. */ @@ -2394,10 +2449,7 @@ read_escape (Lisp_Object readcharfun, bool stringp) c = READCHAR; if (c != '{') invalid_syntax ("Expected opening brace after \\N"); - if (NILP (character_names)) - init_character_names (); - USE_SAFE_ALLOCA; - char *name = SAFE_ALLOCA (max_character_name_length + 1); + char name[UNICODE_CHARACTER_NAME_LENGTH_BOUND + 1]; bool whitespace = false; ptrdiff_t length = 0; while (true) @@ -2426,25 +2478,12 @@ read_escape (Lisp_Object readcharfun, bool stringp) else whitespace = false; name[length++] = c; - if (length >= max_character_name_length) + if (length >= sizeof name) invalid_syntax ("Character name too long"); } if (length == 0) invalid_syntax ("Empty character name"); - name[length] = 0; - Lisp_Object lisp_name = make_unibyte_string (name, length); - Lisp_Object code = - (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? - /* Code point as U+N, where N is between 1 and 8 hexadecimal - digits. */ - string_to_number (name + 2, 16, false) : - Fgethash (lisp_name, character_names, Qnil); - SAFE_FREE (); - if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR)) - xsignal1 (Qinvalid_read_syntax, - CALLN (Fformat, - build_pure_c_string ("\\N{%s}"), lisp_name)); - return XINT (code); + return character_name_to_code (make_unibyte_string (name, length)); } default: @@ -4836,6 +4875,5 @@ that are loaded before your customizations are read! */); DEFSYM (Qrehash_size, "rehash-size"); DEFSYM (Qrehash_threshold, "rehash-threshold"); - character_names = Qnil; - staticpro (&character_names); + DEFSYM (Qucs_names, "ucs-names"); } diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el index 1f87334..ff5d0f6 100644 --- a/test/src/lread-tests.el +++ b/test/src/lread-tests.el @@ -40,10 +40,17 @@ (should-error (read "?\\N{DOES NOT EXIST}")) :type 'invalid-read-syntax) (ert-deftest lread-char-non-ascii-name () - (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}")) 'invalid-read-syntax) + (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}") + :type 'invalid-read-syntax)) (ert-deftest lread-char-empty-name () - (should-error (read "?\\N{}")) 'invalid-read-syntax) + (should-error (read "?\\N{}") :type 'invalid-read-syntax)) + +(ert-deftest lread-char-cjk-name () + (should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734))) + +(ert-deftest lread-char-invalid-cjk-name () + (should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax)) (ert-deftest lread-string-char-number () (should (equal "a\N{U+A817}b" "a\uA817b"))) -- 2.7.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-19 16:27 ` Philipp Stephani @ 2016-03-20 12:58 ` Paul Eggert 2016-03-20 13:25 ` Philipp Stephani 0 siblings, 1 reply; 47+ messages in thread From: Paul Eggert @ 2016-03-20 12:58 UTC (permalink / raw) To: Philipp Stephani, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel Thanks, one thing I didn't notice earlier: + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, build_pure_c_string ("\\N{%s}"), name)); This can run Emacs out of pure space unnecessarily. Use AUTO_STRING instead of build_pure_c_string. Also, I've lost track of what this patch is building on. Perhaps send all the patches next time.... ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-20 12:58 ` Paul Eggert @ 2016-03-20 13:25 ` Philipp Stephani 2016-03-25 17:41 ` Philipp Stephani 0 siblings, 1 reply; 47+ messages in thread From: Philipp Stephani @ 2016-03-20 13:25 UTC (permalink / raw) To: Paul Eggert, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 500 bytes --] Paul Eggert <eggert@cs.ucla.edu> schrieb am So., 20. März 2016 um 13:58 Uhr: > Thanks, one thing I didn't notice earlier: > > + xsignal1 (Qinvalid_read_syntax, > + CALLN (Fformat, build_pure_c_string ("\\N{%s}"), name)); > > This can run Emacs out of pure space unnecessarily. Use AUTO_STRING > instead of > build_pure_c_string. > > Also, I've lost track of what this patch is building on. Perhaps send all > the > patches next time.... > Done. Attached all patches. [-- Attachment #1.2: Type: text/html, Size: 807 bytes --] [-- Attachment #2: 0003-Minor-cleanups-for-character-name-escapes.patch --] [-- Type: application/octet-stream, Size: 2923 bytes --] From 30e6d9dd4e83a36fe07bbeae678b3f086773346e Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 6 Mar 2016 19:27:21 +0100 Subject: [PATCH 3/4] Minor cleanups for character name escapes. * lread.c (init_character_names): Add missing `void'. Remove top-level `const'. (read_escape): Simplify loop a bit. Remove top-level `const'. --- src/lread.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/src/lread.c b/src/lread.c index 6e84fc8..4000637 100644 --- a/src/lread.c +++ b/src/lread.c @@ -2159,20 +2159,20 @@ static ptrdiff_t max_character_name_length; /* Initializes `character_names' and `max_character_name_length'. Called by `read_escape'. */ -void init_character_names () +void init_character_names (void) { character_names = CALLN (Fmake_hash_table, QCtest, Qequal, /* Currently around 100,000 Unicode characters are defined. */ QCsize, make_natnum (100000)); - const Lisp_Object get_property = + Lisp_Object get_property = Fsymbol_function (intern_c_string ("get-char-code-property")); ptrdiff_t length = 0; for (int i = 0; i <= MAX_UNICODE_CHAR; ++i) { - const Lisp_Object code = make_natnum (i); - const Lisp_Object name = call2 (get_property, code, Qname); + Lisp_Object code = make_natnum (i); + Lisp_Object name = call2 (get_property, code, Qname); if (NILP (name)) continue; CHECK_STRING (name); length = max (length, SBYTES (name)); @@ -2418,25 +2418,22 @@ read_escape (Lisp_Object readcharfun, bool stringp) character names in e.g. multi-line strings. */ if (c_isspace (c)) { - if (! whitespace) - { - whitespace = true; - name[length++] = ' '; - } + if (whitespace) + continue; + c = ' '; + whitespace = true; } else - { - whitespace = false; - name[length++] = c; - } + whitespace = false; + name[length++] = c; if (length >= max_character_name_length) invalid_syntax ("Character name too long"); } if (length == 0) invalid_syntax ("Empty character name"); name[length] = 0; - const Lisp_Object lisp_name = make_unibyte_string (name, length); - const Lisp_Object code = + Lisp_Object lisp_name = make_unibyte_string (name, length); + Lisp_Object code = (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? /* Code point as U+N, where N is between 1 and 8 hexadecimal digits. */ -- 2.7.0 [-- Attachment #3: 0001-Implement-named-character-escapes-similar-to-Perl.patch --] [-- Type: application/octet-stream, Size: 7134 bytes --] From 22e299cd23a72a072461befa30a04bf557aecac8 Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 6 Mar 2016 16:16:29 +0100 Subject: [PATCH 1/4] Implement named character escapes, similar to Perl * lread.c (init_character_names): New function. (read_escape): Read Perl-style named character escape sequences. (syms_of_lread): Initialize new variable `character_names'. * test/src/lread-tests.el (lread-char-empty-name): Add test file for src/lread.c. --- src/lread.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++ test/src/lread-tests.el | 54 ++++++++++++++++++++++++++++ 2 files changed, 150 insertions(+) create mode 100644 test/src/lread-tests.el diff --git a/src/lread.c b/src/lread.c index 25e3ff0..6e84fc8 100644 --- a/src/lread.c +++ b/src/lread.c @@ -43,6 +43,7 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ #include "systime.h" #include "termhooks.h" #include "blockinput.h" +#include <c-ctype.h> #ifdef MSDOS #include "msdos.h" @@ -2150,6 +2151,36 @@ grow_read_buffer (void) MAX_MULTIBYTE_LENGTH, -1, 1); } +/* Hash table that maps Unicode character names to code points. */ +static Lisp_Object character_names; + +/* Length of the longest Unicode character name, in bytes. */ +static ptrdiff_t max_character_name_length; + +/* Initializes `character_names' and `max_character_name_length'. + Called by `read_escape'. */ +void init_character_names () +{ + character_names = CALLN (Fmake_hash_table, + QCtest, Qequal, + /* Currently around 100,000 Unicode + characters are defined. */ + QCsize, make_natnum (100000)); + const Lisp_Object get_property = + Fsymbol_function (intern_c_string ("get-char-code-property")); + ptrdiff_t length = 0; + for (int i = 0; i <= MAX_UNICODE_CHAR; ++i) + { + const Lisp_Object code = make_natnum (i); + const Lisp_Object name = call2 (get_property, code, Qname); + if (NILP (name)) continue; + CHECK_STRING (name); + length = max (length, SBYTES (name)); + Fputhash (name, code, character_names); + } + max_character_name_length = length; +} + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, return eight-bit char. */ @@ -2357,6 +2388,68 @@ read_escape (Lisp_Object readcharfun, bool stringp) return i; } + case 'N': + /* Named character. */ + { + c = READCHAR; + if (c != '{') + invalid_syntax ("Expected opening brace after \\N"); + if (NILP (character_names)) + init_character_names (); + USE_SAFE_ALLOCA; + char *name = SAFE_ALLOCA (max_character_name_length + 1); + bool whitespace = false; + ptrdiff_t length = 0; + while (true) + { + c = READCHAR; + if (c < 0) + end_of_file_error (); + if (c == '}') + break; + if (! c_isascii (c)) + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, + build_pure_c_string ("Non-ASCII character U+%04X" + " in character name"), + make_natnum (c))); + /* We treat multiple adjacent whitespace characters as a + single space character. This makes it easier to use + character names in e.g. multi-line strings. */ + if (c_isspace (c)) + { + if (! whitespace) + { + whitespace = true; + name[length++] = ' '; + } + } + else + { + whitespace = false; + name[length++] = c; + } + if (length >= max_character_name_length) + invalid_syntax ("Character name too long"); + } + if (length == 0) + invalid_syntax ("Empty character name"); + name[length] = 0; + const Lisp_Object lisp_name = make_unibyte_string (name, length); + const Lisp_Object code = + (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? + /* Code point as U+N, where N is between 1 and 8 hexadecimal + digits. */ + string_to_number (name + 2, 16, false) : + Fgethash (lisp_name, character_names, Qnil); + SAFE_FREE (); + if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR)) + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, + build_pure_c_string ("\\N{%s}"), lisp_name)); + return XINT (code); + } + default: return c; } @@ -4745,4 +4838,7 @@ that are loaded before your customizations are read! */); DEFSYM (Qweakness, "weakness"); DEFSYM (Qrehash_size, "rehash-size"); DEFSYM (Qrehash_threshold, "rehash-threshold"); + + character_names = Qnil; + staticpro (&character_names); } diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el new file mode 100644 index 0000000..1f87334 --- /dev/null +++ b/test/src/lread-tests.el @@ -0,0 +1,54 @@ +;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*- + +;; Copyright (C) 2016 Google Inc. + +;; Author: Philipp Stephani <phst@google.com> + +;; This file is part of GNU Emacs. + +;; This program is free software; you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; This program is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with this program. If not, see <http://www.gnu.org/licenses/>. + +;;; Commentary: + +;; Unit tests for code in src/lread.c. + +;;; Code: + +(ert-deftest lread-char-number () + (should (equal ?\N{U+A817} #xA817))) + +(ert-deftest lread-char-name () + (should (equal ?\N{SYLOTI NAGRI LETTER + DHO} + #xA817))) + +(ert-deftest lread-char-invalid-number () + (should-error (read "?\\N{U+110000}") :type 'invalid-read-syntax)) + +(ert-deftest lread-char-invalid-name () + (should-error (read "?\\N{DOES NOT EXIST}")) :type 'invalid-read-syntax) + +(ert-deftest lread-char-non-ascii-name () + (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}")) 'invalid-read-syntax) + +(ert-deftest lread-char-empty-name () + (should-error (read "?\\N{}")) 'invalid-read-syntax) + +(ert-deftest lread-string-char-number () + (should (equal "a\N{U+A817}b" "a\uA817b"))) + +(ert-deftest lread-string-char-name () + (should (equal "a\N{SYLOTI NAGRI LETTER DHO}b" "a\uA817b"))) + +;;; lread-tests.el ends here -- 2.7.0 [-- Attachment #4: 0002-Add-documentation-for-character-name-escapes.patch --] [-- Type: application/octet-stream, Size: 2348 bytes --] From d0d5219a358a2d8e853f1ce11cf16fb2629697c6 Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 6 Mar 2016 19:07:06 +0100 Subject: [PATCH 2/4] Add documentation for character name escapes --- doc/lispref/nonascii.texi | 2 +- doc/lispref/objects.texi | 10 ++++++++++ etc/NEWS | 5 +++++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 9cf3b57..66ad9ac 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -633,7 +633,7 @@ Character Properties @end group @group ;; U+2163 ROMAN NUMERAL FOUR -(get-char-code-property ?\u2163 'numeric-value) +(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value) @result{} 4 @end group @group diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi index 3245930..96b334d 100644 --- a/doc/lispref/objects.texi +++ b/doc/lispref/objects.texi @@ -387,6 +387,16 @@ General Escape Syntax for the character @kbd{C-b}. Only characters up to octal code 777 can be specified this way. + Fourthly, you can specify characters by their name. A character +name escape sequence consists of a backslash, @samp{N@{}, the Unicode +character name, and @samp{@}}. Alternatively, you can also put the +numeric code point value between the braces, using the syntax +@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight +hexadecimal digits. Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and +@samp{?\N@{U+41@}} both denote the character @kbd{A}. To simplify +entering multi-line strings, you can replace spaces in the character +names by arbitrary non-empty sequence of whitespace (e.g., newlines). + These escape sequences may also be used in strings. @xref{Non-ASCII in Strings}. diff --git a/etc/NEWS b/etc/NEWS index 92d69d2..9c77474 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -159,6 +159,11 @@ that negotiation should complete even on non-blocking sockets. `window-pixel-height-before-size-change' allow to detect which window changed size when `window-size-change-functions' are run. ++++ +** Emacs now supports character name escape sequences in character and +string literals. The syntax variants \N{character name} and +\N{U+code} are supported. + \f * Changes in Emacs 25.2 on Non-Free Operating Systems -- 2.7.0 [-- Attachment #5: 0004-Use-ucs-names-for-character-name-escapes.patch --] [-- Type: application/octet-stream, Size: 8347 bytes --] From 173eb8b38b4b495a46bca02779ae590130328175 Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 13 Mar 2016 21:27:30 +0100 Subject: [PATCH 4/4] Use `ucs-names' for character name escapes * lread.c (invalid_character_name, check_scalar_value) (parse_code_after_prefix, character_name_to_code): New helper functions that use `ucs-names' and parsing for CJK ideographs. (read_escape): Use helper functions. (syms_of_lread): New symbol `ucs-names'. * test/src/lread-tests.el: New tests; fix a couple of bugs in existing tests. --- src/lread.c | 126 +++++++++++++++++++++++++++++++----------------- test/src/lread-tests.el | 11 ++++- 2 files changed, 91 insertions(+), 46 deletions(-) diff --git a/src/lread.c b/src/lread.c index 4000637..dc023eb 100644 --- a/src/lread.c +++ b/src/lread.c @@ -44,6 +44,7 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ #include "termhooks.h" #include "blockinput.h" #include <c-ctype.h> +#include <string.h> #ifdef MSDOS #include "msdos.h" @@ -2151,36 +2152,90 @@ grow_read_buffer (void) MAX_MULTIBYTE_LENGTH, -1, 1); } -/* Hash table that maps Unicode character names to code points. */ -static Lisp_Object character_names; +/* Signal an invalid-read-syntax error indicating that the character + name in an \N{…} literal is invalid. */ +static _Noreturn void +invalid_character_name (Lisp_Object name) +{ + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, build_pure_c_string ("\\N{%s}"), name)); +} -/* Length of the longest Unicode character name, in bytes. */ -static ptrdiff_t max_character_name_length; +/* Check that CODE is a valid Unicode scalar value, and return its + value. CODE should be parsed from the character name given by + NAME. NAME is used for error messages. */ +static int +check_scalar_value (Lisp_Object code, Lisp_Object name) +{ + if (! NUMBERP (code)) + invalid_character_name (name); + EMACS_INT i = XINT (code); + if (! (0 <= i && i <= MAX_UNICODE_CHAR) + /* Don't allow surrogates. */ + || (0xD800 <= code && code <= 0xDFFF)) + invalid_character_name (name); + return i; +} -/* Initializes `character_names' and `max_character_name_length'. - Called by `read_escape'. */ -void init_character_names (void) +/* If NAME starts with PREFIX, interpret the rest as a hexadecimal + number and return its value. Raise invalid-read-syntax if the + number is not a valid scalar value. Return −1 if NAME doesn’t + start with PREFIX. */ +static int +parse_code_after_prefix (Lisp_Object name, const char *prefix) { - character_names = CALLN (Fmake_hash_table, - QCtest, Qequal, - /* Currently around 100,000 Unicode - characters are defined. */ - QCsize, make_natnum (100000)); - Lisp_Object get_property = - Fsymbol_function (intern_c_string ("get-char-code-property")); - ptrdiff_t length = 0; - for (int i = 0; i <= MAX_UNICODE_CHAR; ++i) + ptrdiff_t name_len = SBYTES (name); + ptrdiff_t prefix_len = strlen (prefix); + /* Allow between one and eight hexadecimal digits after the + prefix. */ + if (prefix_len < name_len && name_len <= prefix_len + 8 + && memcmp (SDATA (name), prefix, prefix_len) == 0) { - Lisp_Object code = make_natnum (i); - Lisp_Object name = call2 (get_property, code, Qname); - if (NILP (name)) continue; - CHECK_STRING (name); - length = max (length, SBYTES (name)); - Fputhash (name, code, character_names); + Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false); + if (NUMBERP (code)) + return check_scalar_value (code, name); } - max_character_name_length = length; + return -1; } +/* Returns the scalar value that has the Unicode character name NAME. + Raises `invalid-read-syntax' if there is no such character. */ +static int +character_name_to_code (Lisp_Object name) +{ + /* Code point as U+N, where N is between 1 and 8 hexadecimal + digits. */ + int code = parse_code_after_prefix (name, "U+"); + if (code >= 0) + return code; + + /* CJK ideographs are not contained in the association list returned + by `ucs-names'. But they follow a predictable naming pattern: a + fixed prefix plus the hexadecimal codepoint value. */ + code = parse_code_after_prefix (name, "CJK IDEOGRAPH-"); + if (code >= 0) + { + /* Various ranges of CJK characters; see UnicodeData.txt. */ + if ((0x3400 <= code && code <= 0x4DB5) + || (0x4E00 <= code && code <= 0x9FD5) + || (0x20000 <= code && code <= 0x2A6D6) + || (0x2A700 <= code && code <= 0x2B734) + || (0x2B740 <= code && code <= 0x2B81D) + || (0x2B820 <= code && code <= 0x2CEA1)) + return code; + else + invalid_character_name (name); + } + + /* Look up the name in the table returned by `ucs-names'. */ + Lisp_Object names = call0 (Qucs_names); + return check_scalar_value (CDR (Fassoc (name, names)), name); +} + +/* Bound on the length of a Unicode character name. As of + Unicode 9.0.0 the maximum is 83, so this should be safe. */ +enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 }; + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, return eight-bit char. */ @@ -2394,10 +2449,7 @@ read_escape (Lisp_Object readcharfun, bool stringp) c = READCHAR; if (c != '{') invalid_syntax ("Expected opening brace after \\N"); - if (NILP (character_names)) - init_character_names (); - USE_SAFE_ALLOCA; - char *name = SAFE_ALLOCA (max_character_name_length + 1); + char name[UNICODE_CHARACTER_NAME_LENGTH_BOUND + 1]; bool whitespace = false; ptrdiff_t length = 0; while (true) @@ -2426,25 +2478,12 @@ read_escape (Lisp_Object readcharfun, bool stringp) else whitespace = false; name[length++] = c; - if (length >= max_character_name_length) + if (length >= sizeof name) invalid_syntax ("Character name too long"); } if (length == 0) invalid_syntax ("Empty character name"); - name[length] = 0; - Lisp_Object lisp_name = make_unibyte_string (name, length); - Lisp_Object code = - (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? - /* Code point as U+N, where N is between 1 and 8 hexadecimal - digits. */ - string_to_number (name + 2, 16, false) : - Fgethash (lisp_name, character_names, Qnil); - SAFE_FREE (); - if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR)) - xsignal1 (Qinvalid_read_syntax, - CALLN (Fformat, - build_pure_c_string ("\\N{%s}"), lisp_name)); - return XINT (code); + return character_name_to_code (make_unibyte_string (name, length)); } default: @@ -4836,6 +4875,5 @@ that are loaded before your customizations are read! */); DEFSYM (Qrehash_size, "rehash-size"); DEFSYM (Qrehash_threshold, "rehash-threshold"); - character_names = Qnil; - staticpro (&character_names); + DEFSYM (Qucs_names, "ucs-names"); } diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el index 1f87334..ff5d0f6 100644 --- a/test/src/lread-tests.el +++ b/test/src/lread-tests.el @@ -40,10 +40,17 @@ (should-error (read "?\\N{DOES NOT EXIST}")) :type 'invalid-read-syntax) (ert-deftest lread-char-non-ascii-name () - (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}")) 'invalid-read-syntax) + (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}") + :type 'invalid-read-syntax)) (ert-deftest lread-char-empty-name () - (should-error (read "?\\N{}")) 'invalid-read-syntax) + (should-error (read "?\\N{}") :type 'invalid-read-syntax)) + +(ert-deftest lread-char-cjk-name () + (should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734))) + +(ert-deftest lread-char-invalid-cjk-name () + (should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax)) (ert-deftest lread-string-char-number () (should (equal "a\N{U+A817}b" "a\uA817b"))) -- 2.7.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-20 13:25 ` Philipp Stephani @ 2016-03-25 17:41 ` Philipp Stephani 2016-04-22 2:39 ` Paul Eggert 0 siblings, 1 reply; 47+ messages in thread From: Philipp Stephani @ 2016-03-25 17:41 UTC (permalink / raw) To: Paul Eggert, Eli Zaretskii; +Cc: larsi, johnw, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 685 bytes --] Philipp Stephani <p.stephani2@gmail.com> schrieb am So., 20. März 2016 um 14:25 Uhr: > Paul Eggert <eggert@cs.ucla.edu> schrieb am So., 20. März 2016 um > 13:58 Uhr: > >> Thanks, one thing I didn't notice earlier: >> >> + xsignal1 (Qinvalid_read_syntax, >> + CALLN (Fformat, build_pure_c_string ("\\N{%s}"), name)); >> >> This can run Emacs out of pure space unnecessarily. Use AUTO_STRING >> instead of >> build_pure_c_string. >> >> Also, I've lost track of what this patch is building on. Perhaps send all >> the >> patches next time.... >> > > Done. Attached all patches. > Oops, forgot to actually commit the changes. New patch attached. [-- Attachment #1.2: Type: text/html, Size: 1309 bytes --] [-- Attachment #2: 0001-Use-ucs-names-for-character-name-escapes.patch --] [-- Type: application/octet-stream, Size: 9269 bytes --] From 808f28cde583e2aa05dffff65b40c684d7895eab Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Sun, 13 Mar 2016 21:27:30 +0100 Subject: [PATCH] Use `ucs-names' for character name escapes * lread.c (invalid_character_name, check_scalar_value) (parse_code_after_prefix, character_name_to_code): New helper functions that use `ucs-names' and parsing for CJK ideographs. (read_escape): Use helper functions. (syms_of_lread): New symbol `ucs-names'. * test/src/lread-tests.el: New tests; fix a couple of bugs in existing tests. --- src/lread.c | 137 +++++++++++++++++++++++++++++++----------------- test/src/lread-tests.el | 11 +++- 2 files changed, 97 insertions(+), 51 deletions(-) diff --git a/src/lread.c b/src/lread.c index 4000637..fd5b363 100644 --- a/src/lread.c +++ b/src/lread.c @@ -44,6 +44,7 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ #include "termhooks.h" #include "blockinput.h" #include <c-ctype.h> +#include <string.h> #ifdef MSDOS #include "msdos.h" @@ -2151,36 +2152,90 @@ grow_read_buffer (void) MAX_MULTIBYTE_LENGTH, -1, 1); } -/* Hash table that maps Unicode character names to code points. */ -static Lisp_Object character_names; +/* Signal an invalid-read-syntax error indicating that the character + name in an \N{…} literal is invalid. */ +static _Noreturn void +invalid_character_name (Lisp_Object name) +{ + AUTO_STRING (format, "\\N{%s}"); + xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, name)); +} -/* Length of the longest Unicode character name, in bytes. */ -static ptrdiff_t max_character_name_length; +/* Check that CODE is a valid Unicode scalar value, and return its + value. CODE should be parsed from the character name given by + NAME. NAME is used for error messages. */ +static int +check_scalar_value (Lisp_Object code, Lisp_Object name) +{ + if (! NUMBERP (code)) + invalid_character_name (name); + EMACS_INT i = XINT (code); + if (! (0 <= i && i <= MAX_UNICODE_CHAR) + /* Don't allow surrogates. */ + || (0xD800 <= code && code <= 0xDFFF)) + invalid_character_name (name); + return i; +} -/* Initializes `character_names' and `max_character_name_length'. - Called by `read_escape'. */ -void init_character_names (void) +/* If NAME starts with PREFIX, interpret the rest as a hexadecimal + number and return its value. Raise invalid-read-syntax if the + number is not a valid scalar value. Return −1 if NAME doesn’t + start with PREFIX. */ +static int +parse_code_after_prefix (Lisp_Object name, const char *prefix) { - character_names = CALLN (Fmake_hash_table, - QCtest, Qequal, - /* Currently around 100,000 Unicode - characters are defined. */ - QCsize, make_natnum (100000)); - Lisp_Object get_property = - Fsymbol_function (intern_c_string ("get-char-code-property")); - ptrdiff_t length = 0; - for (int i = 0; i <= MAX_UNICODE_CHAR; ++i) + ptrdiff_t name_len = SBYTES (name); + ptrdiff_t prefix_len = strlen (prefix); + /* Allow between one and eight hexadecimal digits after the + prefix. */ + if (prefix_len < name_len && name_len <= prefix_len + 8 + && memcmp (SDATA (name), prefix, prefix_len) == 0) { - Lisp_Object code = make_natnum (i); - Lisp_Object name = call2 (get_property, code, Qname); - if (NILP (name)) continue; - CHECK_STRING (name); - length = max (length, SBYTES (name)); - Fputhash (name, code, character_names); + Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false); + if (NUMBERP (code)) + return check_scalar_value (code, name); + } + return -1; +} + +/* Returns the scalar value that has the Unicode character name NAME. + Raises `invalid-read-syntax' if there is no such character. */ +static int +character_name_to_code (Lisp_Object name) +{ + /* Code point as U+N, where N is between 1 and 8 hexadecimal + digits. */ + int code = parse_code_after_prefix (name, "U+"); + if (code >= 0) + return code; + + /* CJK ideographs are not contained in the association list returned + by `ucs-names'. But they follow a predictable naming pattern: a + fixed prefix plus the hexadecimal codepoint value. */ + code = parse_code_after_prefix (name, "CJK IDEOGRAPH-"); + if (code >= 0) + { + /* Various ranges of CJK characters; see UnicodeData.txt. */ + if ((0x3400 <= code && code <= 0x4DB5) + || (0x4E00 <= code && code <= 0x9FD5) + || (0x20000 <= code && code <= 0x2A6D6) + || (0x2A700 <= code && code <= 0x2B734) + || (0x2B740 <= code && code <= 0x2B81D) + || (0x2B820 <= code && code <= 0x2CEA1)) + return code; + else + invalid_character_name (name); } - max_character_name_length = length; + + /* Look up the name in the table returned by `ucs-names'. */ + Lisp_Object names = call0 (Qucs_names); + return check_scalar_value (CDR (Fassoc (name, names)), name); } +/* Bound on the length of a Unicode character name. As of + Unicode 9.0.0 the maximum is 83, so this should be safe. */ +enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 }; + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, return eight-bit char. */ @@ -2394,10 +2449,7 @@ read_escape (Lisp_Object readcharfun, bool stringp) c = READCHAR; if (c != '{') invalid_syntax ("Expected opening brace after \\N"); - if (NILP (character_names)) - init_character_names (); - USE_SAFE_ALLOCA; - char *name = SAFE_ALLOCA (max_character_name_length + 1); + char name[UNICODE_CHARACTER_NAME_LENGTH_BOUND + 1]; bool whitespace = false; ptrdiff_t length = 0; while (true) @@ -2408,11 +2460,12 @@ read_escape (Lisp_Object readcharfun, bool stringp) if (c == '}') break; if (! c_isascii (c)) - xsignal1 (Qinvalid_read_syntax, - CALLN (Fformat, - build_pure_c_string ("Non-ASCII character U+%04X" - " in character name"), - make_natnum (c))); + { + AUTO_STRING (format, + "Non-ASCII character U+%04X in character name"); + xsignal1 (Qinvalid_read_syntax, + CALLN (Fformat, format, make_natnum (c))); + } /* We treat multiple adjacent whitespace characters as a single space character. This makes it easier to use character names in e.g. multi-line strings. */ @@ -2426,25 +2479,12 @@ read_escape (Lisp_Object readcharfun, bool stringp) else whitespace = false; name[length++] = c; - if (length >= max_character_name_length) + if (length >= sizeof name) invalid_syntax ("Character name too long"); } if (length == 0) invalid_syntax ("Empty character name"); - name[length] = 0; - Lisp_Object lisp_name = make_unibyte_string (name, length); - Lisp_Object code = - (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ? - /* Code point as U+N, where N is between 1 and 8 hexadecimal - digits. */ - string_to_number (name + 2, 16, false) : - Fgethash (lisp_name, character_names, Qnil); - SAFE_FREE (); - if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR)) - xsignal1 (Qinvalid_read_syntax, - CALLN (Fformat, - build_pure_c_string ("\\N{%s}"), lisp_name)); - return XINT (code); + return character_name_to_code (make_unibyte_string (name, length)); } default: @@ -4836,6 +4876,5 @@ that are loaded before your customizations are read! */); DEFSYM (Qrehash_size, "rehash-size"); DEFSYM (Qrehash_threshold, "rehash-threshold"); - character_names = Qnil; - staticpro (&character_names); + DEFSYM (Qucs_names, "ucs-names"); } diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el index 1f87334..ff5d0f6 100644 --- a/test/src/lread-tests.el +++ b/test/src/lread-tests.el @@ -40,10 +40,17 @@ (should-error (read "?\\N{DOES NOT EXIST}")) :type 'invalid-read-syntax) (ert-deftest lread-char-non-ascii-name () - (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}")) 'invalid-read-syntax) + (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}") + :type 'invalid-read-syntax)) (ert-deftest lread-char-empty-name () - (should-error (read "?\\N{}")) 'invalid-read-syntax) + (should-error (read "?\\N{}") :type 'invalid-read-syntax)) + +(ert-deftest lread-char-cjk-name () + (should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734))) + +(ert-deftest lread-char-invalid-cjk-name () + (should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax)) (ert-deftest lread-string-char-number () (should (equal "a\N{U+A817}b" "a\uA817b"))) -- 2.7.0 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-25 17:41 ` Philipp Stephani @ 2016-04-22 2:39 ` Paul Eggert 2016-04-22 7:57 ` Eli Zaretskii 0 siblings, 1 reply; 47+ messages in thread From: Paul Eggert @ 2016-04-22 2:39 UTC (permalink / raw) To: Philipp Stephani; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 371 bytes --] Thanks for doing all that. I installed your patches into the Emacs master, along with the attached further patch which omits the undocumented support for escapes like "\N{CJK IDEOGRAPH-3400}" as I couldn't see the utility of these over and above plain "\N{U+3400}", plus it wasn't clear why CJK ideographs needed special-case names whereas other ideographs did not. [-- Attachment #2: 0001-Improve-character-name-escapes.txt --] [-- Type: text/plain, Size: 15431 bytes --] From bd1c7ca67e7429e07f78d4ff49163fd7a67a6765 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@cs.ucla.edu> Date: Thu, 21 Apr 2016 19:26:34 -0700 Subject: [PATCH] Improve character name escapes * doc/lispref/nonascii.texi (Character Properties): Avoid duplication of Unicode names. Reformat examples to fit in narrow pages. * doc/lispref/objects.texi (General Escape Syntax): Simplify and better-organize explanation of \N{...} escapes. * src/character.h (CHAR_SURROGATE_PAIR_P): Remove; unused. (char_surrogate_p): New inline function. * src/lread.c: Do not include string.h; no longer needed. (invalid_character_name, check_scalar_value): Remove; the ideas behind these functions are now bundled into character_name_to_code. (character_name_to_code): Remove undocumented support for "CJK IDEOGRAPH-XXXX" names, as "U+XXXX" suffices. Reject monstrosities like "\N{U+-0}" and null bytes in \N escapes. Reject floating point in \N escapes instead of returning garbage. Use AUTO_STRING_WITH_LEN to lessen pressure on the garbage collector. * test/src/lread-tests.el (lread-char-number, lread-char-name) (lread-string-char-number, lread-string-char-name): Test runtime behavior, not compile-time, as the test framework is not set up to test compile-time. (lread-char-surrogate-1, lread-char-surrogate-2) (lread-char-surrogate-3, lread-char-surrogate-4) (lread-string-char-number-2, lread-string-char-number-3): New tests. (lread-string-char-number-1): Rename from lread-string-char-number. --- doc/lispref/nonascii.texi | 15 ++++--- doc/lispref/objects.texi | 52 +++++++++++------------ src/character.h | 13 +++--- src/lread.c | 104 +++++++++++++--------------------------------- test/src/lread-tests.el | 32 ++++++++------ 5 files changed, 89 insertions(+), 127 deletions(-) diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 66ad9ac..0e4aa86 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -622,18 +622,21 @@ Character Properties @result{} Nd @end group @group -;; U+2084 SUBSCRIPT FOUR -(get-char-code-property ?\u2084 'digit-value) +;; U+2084 +(get-char-code-property ?\N@{SUBSCRIPT FOUR@} + 'digit-value) @result{} 4 @end group @group -;; U+2155 VULGAR FRACTION ONE FIFTH -(get-char-code-property ?\u2155 'numeric-value) +;; U+2155 +(get-char-code-property ?\N@{VULGAR FRACTION ONE FIFTH@} + 'numeric-value) @result{} 0.2 @end group @group -;; U+2163 ROMAN NUMERAL FOUR -(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value) +;; U+2163 +(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} + 'numeric-value) @result{} 4 @end group @group diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi index 96b334d..54894b8 100644 --- a/doc/lispref/objects.texi +++ b/doc/lispref/objects.texi @@ -353,25 +353,32 @@ General Escape Syntax control characters, Emacs provides several types of escape syntax that you can use to specify non-@acronym{ASCII} text characters. +@enumerate +@item @cindex @samp{\} in character constant @cindex backslash in character constants @cindex unicode character escape - Firstly, you can specify characters by their Unicode values. -@code{?\u@var{nnnn}} represents a character with Unicode code point -@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal -number with exactly four digits. The backslash indicates that the -subsequent characters form an escape sequence, and the @samp{u} -specifies a Unicode escape sequence. - - There is a slightly different syntax for specifying Unicode -characters with code points higher than @code{U+@var{ffff}}: -@code{?\U00@var{nnnnnn}} represents the character with code point -@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal -number. The Unicode Standard only defines code points up to -@samp{U+@var{10ffff}}, so if you specify a code point higher than -that, Emacs signals an error. - - Secondly, you can specify characters by their hexadecimal character +You can specify characters by their Unicode names, if any. +@code{?\N@{@var{NAME}@}} represents the Unicode character named +@var{NAME}. Thus, @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}} is +equivalent to @code{?à} and denotes the Unicode character U+00E0. To +simplify entering multi-line strings, you can replace spaces in the +names by non-empty sequences of whitespace (e.g., newlines). + +@item +You can specify characters by their Unicode values. +@code{?\N@{U+@var{X}@}} represents a character with Unicode code point +@var{X}, where @var{X} is a hexadecimal number. Also, +@code{?\u@var{xxxx}} and @code{?\U@var{xxxxxxxx}} represent code +points @var{xxxx} and @var{xxxxxxxx}, respectively, where each @var{x} +is a single hexadecimal digit. For example, @code{?\N@{U+E0@}}, +@code{?\u00e0} and @code{?\U000000E0} are all equivalent to @code{?à} +and to @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}}. The Unicode +Standard defines code points only up to @samp{U+@var{10ffff}}, so if +you specify a code point higher than that, Emacs signals an error. + +@item +You can specify characters by their hexadecimal character codes. A hexadecimal escape sequence consists of a backslash, @samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and @@ -379,23 +386,16 @@ General Escape Syntax You can use any number of hex digits, so you can represent any character code in this way. +@item @cindex octal character code - Thirdly, you can specify characters by their character code in +You can specify characters by their character code in octal. An octal escape sequence consists of a backslash followed by up to three octal digits; thus, @samp{?\101} for the character @kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002} for the character @kbd{C-b}. Only characters up to octal code 777 can be specified this way. - Fourthly, you can specify characters by their name. A character -name escape sequence consists of a backslash, @samp{N@{}, the Unicode -character name, and @samp{@}}. Alternatively, you can also put the -numeric code point value between the braces, using the syntax -@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight -hexadecimal digits. Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and -@samp{?\N@{U+41@}} both denote the character @kbd{A}. To simplify -entering multi-line strings, you can replace spaces in the character -names by arbitrary non-empty sequence of whitespace (e.g., newlines). +@end enumerate These escape sequences may also be used in strings. @xref{Non-ASCII in Strings}. diff --git a/src/character.h b/src/character.h index bc3e155..586f330 100644 --- a/src/character.h +++ b/src/character.h @@ -612,14 +612,13 @@ sanitize_char_width (EMACS_INT width) : (c) <= 0xE01EF ? (c) - 0xE0100 + 17 \ : 0) -/* If C is a high surrogate, return 1. If C is a low surrogate, - return 2. Otherwise, return 0. */ +/* Return true if C is a surrogate. */ -#define CHAR_SURROGATE_PAIR_P(c) \ - ((c) < 0xD800 ? 0 \ - : (c) <= 0xDBFF ? 1 \ - : (c) <= 0xDFFF ? 2 \ - : 0) +INLINE bool +char_surrogate_p (int c) +{ + return 0xD800 <= c && c <= 0xDFFF; +} /* Data type for Unicode general category. diff --git a/src/lread.c b/src/lread.c index c3b6bd7..a42c1f6 100644 --- a/src/lread.c +++ b/src/lread.c @@ -44,7 +44,6 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ #include "termhooks.h" #include "blockinput.h" #include <c-ctype.h> -#include <string.h> #ifdef MSDOS #include "msdos.h" @@ -2151,88 +2150,42 @@ grow_read_buffer (void) MAX_MULTIBYTE_LENGTH, -1, 1); } -/* Signal an invalid-read-syntax error indicating that the character - name in an \N{…} literal is invalid. */ -static _Noreturn void -invalid_character_name (Lisp_Object name) -{ - AUTO_STRING (format, "\\N{%s}"); - xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, name)); -} - -/* Check that CODE is a valid Unicode scalar value, and return its - value. CODE should be parsed from the character name given by - NAME. NAME is used for error messages. */ +/* Return the scalar value that has the Unicode character name NAME. + Raise 'invalid-read-syntax' if there is no such character. */ static int -check_scalar_value (Lisp_Object code, Lisp_Object name) +character_name_to_code (char const *name, ptrdiff_t name_len) { - if (! NUMBERP (code)) - invalid_character_name (name); - EMACS_INT i = XINT (code); - if (! (0 <= i && i <= MAX_UNICODE_CHAR) - /* Don't allow surrogates. */ - || (0xD800 <= code && code <= 0xDFFF)) - invalid_character_name (name); - return i; -} + Lisp_Object code; -/* If NAME starts with PREFIX, interpret the rest as a hexadecimal - number and return its value. Raise invalid-read-syntax if the - number is not a valid scalar value. Return −1 if NAME doesn’t - start with PREFIX. */ -static int -parse_code_after_prefix (Lisp_Object name, const char *prefix) -{ - ptrdiff_t name_len = SBYTES (name); - ptrdiff_t prefix_len = strlen (prefix); - /* Allow between one and eight hexadecimal digits after the - prefix. */ - if (prefix_len < name_len && name_len <= prefix_len + 8 - && memcmp (SDATA (name), prefix, prefix_len) == 0) + /* Code point as U+XXXX.... */ + if (name[0] == 'U' && name[1] == '+') { - Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false); - if (NUMBERP (code)) - return check_scalar_value (code, name); + /* Pass the leading '+' to string_to_number, so that it + rejects monstrosities such as negative values. */ + code = string_to_number (name + 1, 16, false); + } + else + { + /* Look up the name in the table returned by 'ucs-names'. */ + AUTO_STRING_WITH_LEN (namestr, name, name_len); + Lisp_Object names = call0 (Qucs_names); + code = CDR (Fassoc (namestr, names)); } - return -1; -} -/* Returns the scalar value that has the Unicode character name NAME. - Raises `invalid-read-syntax' if there is no such character. */ -static int -character_name_to_code (Lisp_Object name) -{ - /* Code point as U+N, where N is between 1 and 8 hexadecimal - digits. */ - int code = parse_code_after_prefix (name, "U+"); - if (code >= 0) - return code; - - /* CJK ideographs are not contained in the association list returned - by `ucs-names'. But they follow a predictable naming pattern: a - fixed prefix plus the hexadecimal codepoint value. */ - code = parse_code_after_prefix (name, "CJK IDEOGRAPH-"); - if (code >= 0) + if (! (INTEGERP (code) + && 0 <= XINT (code) && XINT (code) <= MAX_UNICODE_CHAR + && ! char_surrogate_p (XINT (code)))) { - /* Various ranges of CJK characters; see UnicodeData.txt. */ - if ((0x3400 <= code && code <= 0x4DB5) - || (0x4E00 <= code && code <= 0x9FD5) - || (0x20000 <= code && code <= 0x2A6D6) - || (0x2A700 <= code && code <= 0x2B734) - || (0x2B740 <= code && code <= 0x2B81D) - || (0x2B820 <= code && code <= 0x2CEA1)) - return code; - else - invalid_character_name (name); + AUTO_STRING (format, "\\N{%s}"); + AUTO_STRING_WITH_LEN (namestr, name, name_len); + xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, namestr)); } - /* Look up the name in the table returned by `ucs-names'. */ - Lisp_Object names = call0 (Qucs_names); - return check_scalar_value (CDR (Fassoc (name, names)), name); + return XINT (code); } /* Bound on the length of a Unicode character name. As of - Unicode 9.0.0 the maximum is 83, so this should be safe. */ + Unicode 9.0.0 the maximum is 83, so this should be safe. */ enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 }; /* Read a \-escape sequence, assuming we already read the `\'. @@ -2458,14 +2411,14 @@ read_escape (Lisp_Object readcharfun, bool stringp) end_of_file_error (); if (c == '}') break; - if (! c_isascii (c)) + if (! (0 < c && c < 0x80)) { AUTO_STRING (format, - "Non-ASCII character U+%04X in character name"); + "Invalid character U+%04X in character name"); xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, make_natnum (c))); } - /* We treat multiple adjacent whitespace characters as a + /* Treat multiple adjacent whitespace characters as a single space character. This makes it easier to use character names in e.g. multi-line strings. */ if (c_isspace (c)) @@ -2483,7 +2436,8 @@ read_escape (Lisp_Object readcharfun, bool stringp) } if (length == 0) invalid_syntax ("Empty character name"); - return character_name_to_code (make_unibyte_string (name, length)); + name[length] = '\0'; + return character_name_to_code (name, length); } default: diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el index ff5d0f6..2ebaf49 100644 --- a/test/src/lread-tests.el +++ b/test/src/lread-tests.el @@ -1,6 +1,6 @@ ;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*- -;; Copyright (C) 2016 Google Inc. +;; Copyright (C) 2016 Free Software Foundation, Inc. ;; Author: Philipp Stephani <phst@google.com> @@ -26,11 +26,10 @@ ;;; Code: (ert-deftest lread-char-number () - (should (equal ?\N{U+A817} #xA817))) + (should (equal (read "?\\N{U+A817}") #xA817))) (ert-deftest lread-char-name () - (should (equal ?\N{SYLOTI NAGRI LETTER - DHO} + (should (equal (read "?\\N{SYLOTI NAGRI LETTER \n DHO}") #xA817))) (ert-deftest lread-char-invalid-number () @@ -46,16 +45,23 @@ (ert-deftest lread-char-empty-name () (should-error (read "?\\N{}") :type 'invalid-read-syntax)) -(ert-deftest lread-char-cjk-name () - (should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734))) - -(ert-deftest lread-char-invalid-cjk-name () - (should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax)) - -(ert-deftest lread-string-char-number () - (should (equal "a\N{U+A817}b" "a\uA817b"))) +(ert-deftest lread-char-surrogate-1 () + (should-error (read "?\\N{U+D800}") :type 'invalid-read-syntax)) +(ert-deftest lread-char-surrogate-2 () + (should-error (read "?\\N{U+D801}") :type 'invalid-read-syntax)) +(ert-deftest lread-char-surrogate-3 () + (should-error (read "?\\N{U+Dffe}") :type 'invalid-read-syntax)) +(ert-deftest lread-char-surrogate-4 () + (should-error (read "?\\N{U+DFFF}") :type 'invalid-read-syntax)) + +(ert-deftest lread-string-char-number-1 () + (should (equal (read "a\\N{U+A817}b") "a\uA817bx"))) +(ert-deftest lread-string-char-number-2 () + (should-error (read "?\\N{0.5}") :type 'invalid-read-syntax)) +(ert-deftest lread-string-char-number-3 () + (should-error (read "?\\N{U+-0}") :type 'invalid-read-syntax)) (ert-deftest lread-string-char-name () - (should (equal "a\N{SYLOTI NAGRI LETTER DHO}b" "a\uA817b"))) + (should (equal (read "a\\N{SYLOTI NAGRI LETTER DHO}b") "a\uA817b"))) ;;; lread-tests.el ends here -- 2.5.5 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-04-22 2:39 ` Paul Eggert @ 2016-04-22 7:57 ` Eli Zaretskii 2016-04-22 8:01 ` Eli Zaretskii 0 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-04-22 7:57 UTC (permalink / raw) To: Paul Eggert; +Cc: p.stephani2, emacs-devel > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Thu, 21 Apr 2016 19:39:50 -0700 > Cc: emacs-devel@gnu.org > > Thanks for doing all that. I installed your patches into the Emacs > master, along with the attached further patch which omits the > undocumented support for escapes like "\N{CJK IDEOGRAPH-3400}" as I > couldn't see the utility of these over and above plain "\N{U+3400}", > plus it wasn't clear why CJK ideographs needed special-case names > whereas other ideographs did not. I think \N{} should accept any name that appears in UnicodeData.txt, so "CJK IDEOGRAPH-3400" and its ilk should be an exception, even if we don't see any great utility in that. I think omitting any names is a mistake, unless other environments (e.g., Python) do the same. As for other ideographs that are not treated the same, I'm not sure which ones you had in mind; can you elaborate? Note that, as I said in http://lists.gnu.org/archive/html/emacs-devel/2016-03/msg00919.html a new range of similarly treated ideographs was added by Unicode 9.0, so they should be treated the same (they don't appear in ucs-names). Thanks. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-04-22 7:57 ` Eli Zaretskii @ 2016-04-22 8:01 ` Eli Zaretskii 2016-04-22 9:39 ` Elias Mårtenson 0 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-04-22 8:01 UTC (permalink / raw) To: eggert; +Cc: p.stephani2, emacs-devel > Date: Fri, 22 Apr 2016 10:57:26 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: p.stephani2@gmail.com, emacs-devel@gnu.org > > I think \N{} should accept any name that appears in UnicodeData.txt, > so "CJK IDEOGRAPH-3400" and its ilk should be an exception, even if we ^^^^^^^^^^^^^^^^^^^^^^ I meant "shouldn't", of course. Sorry for the typo. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-04-22 8:01 ` Eli Zaretskii @ 2016-04-22 9:39 ` Elias Mårtenson 2016-04-22 10:01 ` Eli Zaretskii 0 siblings, 1 reply; 47+ messages in thread From: Elias Mårtenson @ 2016-04-22 9:39 UTC (permalink / raw) To: Eli Zaretskii; +Cc: p.stephani2, Paul Eggert, emacs-devel [-- Attachment #1: Type: text/plain, Size: 527 bytes --] On 22 April 2016 at 16:01, Eli Zaretskii <eliz@gnu.org> wrote: > > Date: Fri, 22 Apr 2016 10:57:26 +0300 > > From: Eli Zaretskii <eliz@gnu.org> > > Cc: p.stephani2@gmail.com, emacs-devel@gnu.org > > > > I think \N{} should accept any name that appears in UnicodeData.txt, > > so "CJK IDEOGRAPH-3400" and its ilk should be an exception, even if we > ^^^^^^^^^^^^^^^^^^^^^^ > I meant "shouldn't", of course. Sorry for the typo. > CJK IDEOGRAPH-nnnn is not part of UnicodeData.txt though. [-- Attachment #2: Type: text/html, Size: 1062 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-04-22 9:39 ` Elias Mårtenson @ 2016-04-22 10:01 ` Eli Zaretskii 2016-04-25 17:48 ` Paul Eggert 0 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-04-22 10:01 UTC (permalink / raw) To: Elias Mårtenson; +Cc: p.stephani2, eggert, emacs-devel > Date: Fri, 22 Apr 2016 17:39:38 +0800 > From: Elias Mårtenson <lokedhs@gmail.com> > Cc: Paul Eggert <eggert@cs.ucla.edu>, p.stephani2@gmail.com, > emacs-devel <emacs-devel@gnu.org> > > CJK IDEOGRAPH-nnnn is not part of UnicodeData.txt though. I meant "CJK COMPATIBILITY IDEOGRAPH-nnnn", sorry for the confusion. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-04-22 10:01 ` Eli Zaretskii @ 2016-04-25 17:48 ` Paul Eggert 0 siblings, 0 replies; 47+ messages in thread From: Paul Eggert @ 2016-04-25 17:48 UTC (permalink / raw) To: Eli Zaretskii, Elias Mårtenson; +Cc: p.stephani2, emacs-devel [-- Attachment #1: Type: text/plain, Size: 608 bytes --] On 04/22/2016 03:01 AM, Eli Zaretskii wrote: > I meant "CJK COMPATIBILITY IDEOGRAPH-nnnn", sorry for the confusion. OK, thanks for clarifying. In looking into this, I noticed a curious ambiguity: 'C-x 8 RET B E D RET' inserts U+0BED (TAMIL DIGIT SEVEN), whereas it should insert U+1F6CF (BED). BED is currently the only Unicode name that consists entirely of hexadecimal digits. I installed the attached patch into master to fix the problem you mentioned, along with fixing this ambiguity and a couple of other related problems that I noticed. Perhaps BED should be renamed, but that's not our job. [-- Attachment #2: 0001-New-function-char-from-name.txt --] [-- Type: text/plain, Size: 11098 bytes --] From aef09688a357c815fd3ccfbc04592717737e6c86 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@cs.ucla.edu> Date: Mon, 25 Apr 2016 10:41:29 -0700 Subject: [PATCH] =?UTF-8?q?New=20function=20=E2=80=98char-from-name?= =?UTF-8?q?=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This also fixes the mishandling of "\N{CJK COMPATIBILITY IDEOGRAPH-F900}", "\N{VARIATION SELECTOR-1}", etc. Problem reported by Eli Zaretskii in: http://lists.gnu.org/archive/html/emacs-devel/2016-04/msg00614.html * doc/lispref/nonascii.texi (Character Codes), etc/NEWS: Document this. * lisp/international/mule-cmds.el (char-from-name): New function. (read-char-by-name): Use it. Document that "BED" is treated as a name, not as a hexadecimal number. Reject out-of-range integers, floating-point numbers, and strings with trailing junk. * src/lread.c (character_name_to_code): Call char-from-name instead of inspecting ucs-names directly, so that we handle computed names like "VARIATION SELECTOR-1". Do not use an auto string, since char-from-name might GC. * test/src/lread-tests.el: Add tests for new behavior, and fix some old tests that were wrong. --- doc/lispref/nonascii.texi | 12 +++++++++++ etc/NEWS | 4 ++++ lisp/international/mule-cmds.el | 43 +++++++++++++++++++++++++++--------- src/lread.c | 31 +++++++++----------------- test/src/lread-tests.el | 48 +++++++++++++++++++++++++++++++++++++---- 5 files changed, 103 insertions(+), 35 deletions(-) diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 0e4aa86..fd2ce32 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -420,6 +420,18 @@ Character Codes @end example @end defun +@defun char-from-name string &optional ignore-case +This function returns the character whose Unicode name is @var{string}. +If @var{ignore-case} is non-@code{nil}, case is ignored in @var{string}. +This function returns @code{nil} if @var{string} does not name a character. + +@example +;; U+03A3 +(= (char-from-name "GREEK CAPITAL LETTER SIGMA") #x03A3) + @result{} t +@end example +@end defun + @defun get-byte &optional pos string This function returns the byte at character position @var{pos} in the current buffer. If the current buffer is unibyte, this is literally diff --git a/etc/NEWS b/etc/NEWS index 6bdb648..e401d2d 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -391,6 +391,10 @@ compares their numerical values. According to this predicate, "foo2.png" is smaller than "foo12.png". +++ +** The new function 'char-from-name' converts a Unicode name string +to the corresponding character code. + ++++ ** New functions 'sxhash-eq' and 'sxhash-eql' return hash codes of a Lisp object suitable for use with 'eq' and 'eql' correspondingly. If two objects are 'eq' ('eql'), then the result of 'sxhash-eq' diff --git a/lisp/international/mule-cmds.el b/lisp/international/mule-cmds.el index 8eb320a..2ce21a8 100644 --- a/lisp/international/mule-cmds.el +++ b/lisp/international/mule-cmds.el @@ -2978,6 +2978,27 @@ mule--ucs-names-annotation (let ((char (assoc name ucs-names))) (when char (format " (%c)" (cdr char))))) +(defun char-from-name (string &optional ignore-case) + "Return a character as a number from its Unicode name STRING. +If optional IGNORE-CASE is non-nil, ignore case in STRING. +Return nil if STRING does not name a character." + (or (cdr (assoc-string string (ucs-names) ignore-case)) + (let ((minus (string-match-p "-[0-9A-F]+\\'" string))) + (when minus + ;; Parse names like "VARIATION SELECTOR-17" and "CJK + ;; COMPATIBILITY IDEOGRAPH-F900" that are not in ucs-names. + (ignore-errors + (let* ((case-fold-search ignore-case) + (vs (string-match-p "\\`VARIATION SELECTOR-" string)) + (minus-num (string-to-number (substring string minus) + (if vs 10 16))) + (vs-offset (if vs (if (< minus-num -16) #xE00EF #xFDFF) 0)) + (code (- vs-offset minus-num)) + (name (get-char-code-property code 'name))) + (when (eq t (compare-strings string nil nil name nil nil + ignore-case)) + code))))))) + (defun read-char-by-name (prompt) "Read a character by its Unicode name or hex number string. Display PROMPT and read a string that represents a character by its @@ -2991,9 +3012,11 @@ read-char-by-name the characters whose names include that substring, not necessarily at the beginning of the name. -This function also accepts a hexadecimal number of Unicode code -point or a number in hash notation, e.g. #o21430 for octal, -#x2318 for hex, or #10r8984 for decimal." +Accept a name like \"CIRCULATION FUNCTION\", a hexadecimal +number like \"2A10\", or a number in hash notation (e.g., +\"#x2a10\" for hex, \"10r10768\" for decimal, or \"#o25020\" for +octal). Treat otherwise-ambiguous strings like \"BED\" (U+1F6CF) +as names, not numbers." (let* ((enable-recursive-minibuffers t) (completion-ignore-case t) (input @@ -3006,13 +3029,13 @@ read-char-by-name (category . unicode-name)) (complete-with-action action (ucs-names) string pred))))) (char - (cond - ((string-match-p "\\`[0-9a-fA-F]+\\'" input) - (string-to-number input 16)) - ((string-match-p "\\`#" input) - (read input)) - (t - (cdr (assoc-string input (ucs-names) t)))))) + (cond + ((char-from-name input t)) + ((string-match-p "\\`[0-9a-fA-F]+\\'" input) + (ignore-errors (string-to-number input 16))) + ((string-match-p "\\`#\\([bBoOxX]\\|[0-9]+[rR]\\)[0-9a-zA-Z]+\\'" + input) + (ignore-errors (read input)))))) (unless (characterp char) (error "Invalid character")) char)) diff --git a/src/lread.c b/src/lread.c index a42c1f6..6e97e07 100644 --- a/src/lread.c +++ b/src/lread.c @@ -2155,26 +2155,15 @@ grow_read_buffer (void) static int character_name_to_code (char const *name, ptrdiff_t name_len) { - Lisp_Object code; - - /* Code point as U+XXXX.... */ - if (name[0] == 'U' && name[1] == '+') - { - /* Pass the leading '+' to string_to_number, so that it - rejects monstrosities such as negative values. */ - code = string_to_number (name + 1, 16, false); - } - else - { - /* Look up the name in the table returned by 'ucs-names'. */ - AUTO_STRING_WITH_LEN (namestr, name, name_len); - Lisp_Object names = call0 (Qucs_names); - code = CDR (Fassoc (namestr, names)); - } - - if (! (INTEGERP (code) - && 0 <= XINT (code) && XINT (code) <= MAX_UNICODE_CHAR - && ! char_surrogate_p (XINT (code)))) + /* For "U+XXXX", pass the leading '+' to string_to_number to reject + monstrosities like "U+-0000". */ + Lisp_Object code + = (name[0] == 'U' && name[1] == '+' + ? string_to_number (name + 1, 16, false) + : call2 (Qchar_from_name, make_unibyte_string (name, name_len), Qt)); + + if (! RANGED_INTEGERP (0, code, MAX_UNICODE_CHAR) + || char_surrogate_p (XINT (code))) { AUTO_STRING (format, "\\N{%s}"); AUTO_STRING_WITH_LEN (namestr, name, name_len); @@ -4829,5 +4818,5 @@ that are loaded before your customizations are read! */); DEFSYM (Qrehash_size, "rehash-size"); DEFSYM (Qrehash_threshold, "rehash-threshold"); - DEFSYM (Qucs_names, "ucs-names"); + DEFSYM (Qchar_from_name, "char-from-name"); } diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el index 2ebaf49..1a82d13 100644 --- a/test/src/lread-tests.el +++ b/test/src/lread-tests.el @@ -28,15 +28,55 @@ (ert-deftest lread-char-number () (should (equal (read "?\\N{U+A817}") #xA817))) -(ert-deftest lread-char-name () +(ert-deftest lread-char-name-1 () (should (equal (read "?\\N{SYLOTI NAGRI LETTER \n DHO}") #xA817))) +(ert-deftest lread-char-name-2 () + (should (equal (read "?\\N{BED}") #x1F6CF))) +(ert-deftest lread-char-name-3 () + (should (equal (read "?\\N{U+BED}") #xBED))) +(ert-deftest lread-char-name-4 () + (should (equal (read "?\\N{VARIATION SELECTOR-1}") #xFE00))) +(ert-deftest lread-char-name-5 () + (should (equal (read "?\\N{VARIATION SELECTOR-16}") #xFE0F))) +(ert-deftest lread-char-name-6 () + (should (equal (read "?\\N{VARIATION SELECTOR-17}") #xE0100))) +(ert-deftest lread-char-name-7 () + (should (equal (read "?\\N{VARIATION SELECTOR-256}") #xE01EF))) +(ert-deftest lread-char-name-8 () + (should (equal (read "?\\N{CJK COMPATIBILITY IDEOGRAPH-F900}") #xF900))) +(ert-deftest lread-char-name-9 () + (should (equal (read "?\\N{CJK COMPATIBILITY IDEOGRAPH-FAD9}") #xFAD9))) +(ert-deftest lread-char-name-10 () + (should (equal (read "?\\N{CJK COMPATIBILITY IDEOGRAPH-2F800}") #x2F800))) +(ert-deftest lread-char-name-11 () + (should (equal (read "?\\N{CJK COMPATIBILITY IDEOGRAPH-2FA1D}") #x2FA1D))) (ert-deftest lread-char-invalid-number () (should-error (read "?\\N{U+110000}") :type 'invalid-read-syntax)) -(ert-deftest lread-char-invalid-name () +(ert-deftest lread-char-invalid-name-1 () (should-error (read "?\\N{DOES NOT EXIST}")) :type 'invalid-read-syntax) +(ert-deftest lread-char-invalid-name-2 () + (should-error (read "?\\N{VARIATION SELECTOR-0}")) :type 'invalid-read-syntax) +(ert-deftest lread-char-invalid-name-3 () + (should-error (read "?\\N{VARIATION SELECTOR-257}")) + :type 'invalid-read-syntax) +(ert-deftest lread-char-invalid-name-4 () + (should-error (read "?\\N{VARIATION SELECTOR--0}")) + :type 'invalid-read-syntax) +(ert-deftest lread-char-invalid-name-5 () + (should-error (read "?\\N{CJK COMPATIBILITY IDEOGRAPH-F8FF}")) + :type 'invalid-read-syntax) +(ert-deftest lread-char-invalid-name-6 () + (should-error (read "?\\N{CJK COMPATIBILITY IDEOGRAPH-FADA}")) + :type 'invalid-read-syntax) +(ert-deftest lread-char-invalid-name-7 () + (should-error (read "?\\N{CJK COMPATIBILITY IDEOGRAPH-2F7FF}")) + :type 'invalid-read-syntax) +(ert-deftest lread-char-invalid-name-8 () + (should-error (read "?\\N{CJK COMPATIBILITY IDEOGRAPH-2FA1E}")) + :type 'invalid-read-syntax) (ert-deftest lread-char-non-ascii-name () (should-error (read "?\\N{LATIN CAPITAL LETTER Ø}") @@ -55,13 +95,13 @@ (should-error (read "?\\N{U+DFFF}") :type 'invalid-read-syntax)) (ert-deftest lread-string-char-number-1 () - (should (equal (read "a\\N{U+A817}b") "a\uA817bx"))) + (should (equal (read "\"a\\N{U+A817}b\"") "a\uA817b"))) (ert-deftest lread-string-char-number-2 () (should-error (read "?\\N{0.5}") :type 'invalid-read-syntax)) (ert-deftest lread-string-char-number-3 () (should-error (read "?\\N{U+-0}") :type 'invalid-read-syntax)) (ert-deftest lread-string-char-name () - (should (equal (read "a\\N{SYLOTI NAGRI LETTER DHO}b") "a\uA817b"))) + (should (equal (read "\"a\\N{SYLOTI NAGRI LETTER DHO}b\"") "a\uA817b"))) ;;; lread-tests.el ends here -- 2.5.5 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-03 16:11 ` Paul Eggert 2016-03-03 20:48 ` Eli Zaretskii 2016-03-05 15:28 ` Philipp Stephani @ 2016-03-05 16:35 ` Clément Pit--Claudel 2016-03-05 17:12 ` Paul Eggert 2 siblings, 1 reply; 47+ messages in thread From: Clément Pit--Claudel @ 2016-03-05 16:35 UTC (permalink / raw) To: emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 626 bytes --] On 03/03/2016 11:11 AM, Paul Eggert wrote: > More issues: should we insist on the full official name? should we > allow obsolescent aliases? lower-case instead of upper case? initial > prefixes of names? Another issue is text wrapping: M-q on a docstring containing these escape sequences will break lines in a way that will look ugly when viewing the rendered docstring: "Use Greek capital letters (\u[GREEK CAPITAL LETTER ALPHA] \u[EN DASH]\u[GREEK CAPITAL LETTER OMEGA]) to denote figures." will be rendered as "Use Greek capital letters (Α– Ω) to denote figures." which isn't right. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 16:35 ` Clément Pit--Claudel @ 2016-03-05 17:12 ` Paul Eggert 2016-03-05 17:53 ` Clément Pit--Claudel 0 siblings, 1 reply; 47+ messages in thread From: Paul Eggert @ 2016-03-05 17:12 UTC (permalink / raw) To: Clément Pit--Claudel, emacs-devel Clément Pit--Claudel wrote: > Another issue is text wrapping: M-q on a docstring containing these escape sequences will break lines in a way that will look ugly when viewing the rendered docstring: > > "Use Greek capital letters (\u[GREEK CAPITAL LETTER ALPHA] > \u[EN DASH]\u[GREEK CAPITAL LETTER OMEGA]) to denote figures." > > will be rendered as > > "Use Greek capital letters (Α– > Ω) to denote figures." > > which isn't right. I don't see a problem here. The original string should look something like this: "Use Greek capital letters (\N{GREEK CAPITAL LETTER ALPHA}\N{EN DASH}\N{GREEK CAPITAL LETTER OMEGA}) to denote figures." and there's no space between the "DASH}" and the following "\N{GREEK" for M-q to latch onto. I just now tried M-q on the above string and it came up with: (defun foo (abc) "Use Greek capital letters (\N{GREEK CAPITAL LETTER ALPHA}\N{EN DASH}\N{GREEK CAPITAL LETTER OMEGA}) to denote figures." ...) which should work OK if arbitrary white space is allowed between words inside \N{...} escapes. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 17:12 ` Paul Eggert @ 2016-03-05 17:53 ` Clément Pit--Claudel 2016-03-05 18:16 ` Eli Zaretskii 0 siblings, 1 reply; 47+ messages in thread From: Clément Pit--Claudel @ 2016-03-05 17:53 UTC (permalink / raw) To: Paul Eggert, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 1275 bytes --] On 03/05/2016 12:12 PM, Paul Eggert wrote: > I don't see a problem here. The original string should look something like this: > > "Use Greek capital letters (\N{GREEK CAPITAL LETTER ALPHA}\N{EN DASH}\N{GREEK CAPITAL LETTER OMEGA}) to denote figures." > > and there's no space between the "DASH}" and the following "\N{GREEK" for M-q to latch onto. I just now tried M-q on the above string and it came up with: > > (defun foo (abc) > "Use Greek capital letters (\N{GREEK CAPITAL LETTER ALPHA}\N{EN > DASH}\N{GREEK CAPITAL LETTER OMEGA}) to denote figures." > ...) > > which should work OK if arbitrary white space is allowed between words inside \N{...} escapes. Sorry, maybe I wasn't clear. My point was about the fact that since the escapes and the actual characters don't have the same length, and since printing a docstring doesn't rewrap it, docstrings wrapped with M-q in the source will look wrong after rendering. Given this (wrapped with M-q): (defun aaa () "AAAA. \N{ARROW POINTING DOWNWARDS THEN CURVING LEFTWARDS} is not the same as \N{ARROW POINTING RIGHTWARDS THEN CURVING DOWNWARDS}.") The rendering will be aaa is a Lisp function. (aaa) AAAA. ⤶ is not the same as ⤵. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 17:53 ` Clément Pit--Claudel @ 2016-03-05 18:16 ` Eli Zaretskii 2016-03-05 18:34 ` Clément Pit--Claudel 2016-03-06 15:49 ` Joost Kremers 0 siblings, 2 replies; 47+ messages in thread From: Eli Zaretskii @ 2016-03-05 18:16 UTC (permalink / raw) To: Clément Pit--Claudel; +Cc: eggert, emacs-devel > From: Clément Pit--Claudel <clement.pit@gmail.com> > Date: Sat, 5 Mar 2016 12:53:10 -0500 > > Sorry, maybe I wasn't clear. My point was about the fact that since the escapes and the actual characters don't have the same length, and since printing a docstring doesn't rewrap it, docstrings wrapped with M-q in the source will look wrong after rendering. Doc strings should never be wrapped with the likes of M-q. For starters, this can make the first line include more than one sentence. More generally, there are already constructs we recognize in doc strings that produce longer or shorter strings when displayed, so M-q is just not up to the job, and shouldn't be used. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 18:16 ` Eli Zaretskii @ 2016-03-05 18:34 ` Clément Pit--Claudel 2016-03-05 18:56 ` Eli Zaretskii 2016-03-06 15:49 ` Joost Kremers 1 sibling, 1 reply; 47+ messages in thread From: Clément Pit--Claudel @ 2016-03-05 18:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: eggert, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 881 bytes --] On 03/05/2016 01:16 PM, Eli Zaretskii wrote: >> From: Clément Pit--Claudel <clement.pit@gmail.com> >> >> Sorry, maybe I wasn't clear. My point was about the fact that since >> the escapes and the actual characters don't have the same length, >> and since printing a docstring doesn't rewrap it, docstrings >> wrapped with M-q in the source will look wrong after rendering. > > Doc strings should never be wrapped with the likes of M-q. I see, thanks. > For starters, this can make the first line include more than one > sentence. Can it? Don't we have special code in `lisp-fill-paragraph' to avoid this? > More generally, there are already constructs we recognize > in doc strings that produce longer or shorter strings when > displayed, so M-q is just not up to the job, and shouldn't be used. Thanks for clarifying. Do we have something else? [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 18:34 ` Clément Pit--Claudel @ 2016-03-05 18:56 ` Eli Zaretskii 2016-03-05 19:08 ` Drew Adams 0 siblings, 1 reply; 47+ messages in thread From: Eli Zaretskii @ 2016-03-05 18:56 UTC (permalink / raw) To: Clément Pit--Claudel; +Cc: eggert, emacs-devel > Cc: eggert@cs.ucla.edu, emacs-devel@gnu.org > From: Clément Pit--Claudel <clement.pit@gmail.com> > Date: Sat, 5 Mar 2016 13:34:46 -0500 > > > For starters, this can make the first line include more than one > > sentence. > > Can it? Don't we have special code in `lisp-fill-paragraph' to avoid this? Yes, if you do that in the right mode, and if there are no backslashes in the string before the newline etc. I wouldn't rely on this 100%. > > More generally, there are already constructs we recognize > > in doc strings that produce longer or shorter strings when > > displayed, so M-q is just not up to the job, and shouldn't be used. > > Thanks for clarifying. Do we have something else? I simply test the results, then adjust the lines as needed. ^ permalink raw reply [flat|nested] 47+ messages in thread
* RE: Character literals for Unicode (control) characters 2016-03-05 18:56 ` Eli Zaretskii @ 2016-03-05 19:08 ` Drew Adams 2016-03-05 22:52 ` Clément Pit--Claudel 0 siblings, 1 reply; 47+ messages in thread From: Drew Adams @ 2016-03-05 19:08 UTC (permalink / raw) To: Eli Zaretskii, Clément Pit--Claudel; +Cc: eggert, emacs-devel > > Do we have something else? > > I simply test the results, then adjust the lines as needed. +1. One should always check doc strings after rendering, in *Help*, at least those that use constructs such a \\[...], \\{...}, and \\<...>, and variable-width chars such as tab. Do not rely on what they look like in Lisp source code. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 19:08 ` Drew Adams @ 2016-03-05 22:52 ` Clément Pit--Claudel 0 siblings, 0 replies; 47+ messages in thread From: Clément Pit--Claudel @ 2016-03-05 22:52 UTC (permalink / raw) To: Drew Adams, Eli Zaretskii; +Cc: eggert, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 421 bytes --] On 03/05/2016 02:08 PM, Drew Adams wrote: >>> Do we have something else? >> >> I simply test the results, then adjust the lines as needed. > > +1. One should always check doc strings after rendering, in > *Help*, at least those that use constructs such a \\[...], > \\{...}, and \\<...>, and variable-width chars such as tab. > Do not rely on what they look like in Lisp source code. Thanks Eli and Drew! [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Character literals for Unicode (control) characters 2016-03-05 18:16 ` Eli Zaretskii 2016-03-05 18:34 ` Clément Pit--Claudel @ 2016-03-06 15:49 ` Joost Kremers 2016-03-06 16:55 ` Drew Adams 1 sibling, 1 reply; 47+ messages in thread From: Joost Kremers @ 2016-03-06 15:49 UTC (permalink / raw) To: Eli Zaretskii; +Cc: eggert, Clément Pit--Claudel, emacs-devel On Sat, Mar 05 2016, Eli Zaretskii wrote: > Doc strings should never be wrapped with the likes of M-q. For > starters, this can make the first line include more than one > sentence. IME, if you hit M-q in a doc string, the first line isn't changed. Perhaps that only works when point is not on the first line, though. > More generally, there are already constructs we recognize > in doc strings that produce longer or shorter strings when displayed, > so M-q is just not up to the job, and shouldn't be used. But there's no real alternative, is there? IOW, you might as well use M-q, because if you use something like \\[my-function], you never know if that'll display as a short key binding, a long key binding, or as `M-x my-function', which can actually be very long. (I guess the best way would be to custom-wrap doc strings before displaying them, after constructs such as \\[...] have been resolved.) -- Joost Kremers Life has its moments ^ permalink raw reply [flat|nested] 47+ messages in thread
* RE: Character literals for Unicode (control) characters 2016-03-06 15:49 ` Joost Kremers @ 2016-03-06 16:55 ` Drew Adams 0 siblings, 0 replies; 47+ messages in thread From: Drew Adams @ 2016-03-06 16:55 UTC (permalink / raw) To: Joost Kremers, Eli Zaretskii Cc: eggert, Clément Pit--Claudel, emacs-devel > > so M-q is just not up to the job, and shouldn't be used. > > But there's no real alternative, is there? IOW, you might as well > use M-q, because if you use something like \\[my-function], you > never know if that'll display as a short key binding, a long key > binding, or as `M-x my-function', which can actually be very long. You never know for sure. But usually you know what to _expect_. Often a doc string with \\[...] is for a command that is used in a given context, where you can have a good idea whether the command is likely to be bound to a key or not. But of course, it could be bound to another key than the default one, and the bound key could have a much longer description because of prefix keys. So no, there is no silver bullet here. You just need to use good, common sense, aiming for the most typical, expected, default use case/context. The reason for conventional doc strings, including a maximum line length, is for user convenience. This includes simply reading but also things like window and frame fitting to the buffer content (line lengths and number of lines). As with all attempts to help users at coding time, we can only either (1) try to address the most common expected use cases or (2) provide users a way to customize the behavior. For doc strings, this comes down mainly to #1. > (I guess the best way would be to custom-wrap doc strings before > displaying them, after constructs such as \\[...] have been > resolved.) It's not just about wrapping. You need to look at rendered doc strings to see what the effect is on embedded TAB chars also, e.g., to try to align text that is essentially tabular, such as key descriptions. It's really not a big deal to use `C-M-x' to reevaluate, and then use `C-h f' to see what the doc string looks like when rendered. And yes, you might need to first put the current buffer in the right mode, so it picks up the right keymaps. You might even need to set the font so that it displays each of the chars in the doc string correctly. ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2016-04-25 17:48 UTC | newest] Thread overview: 47+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-03 5:47 Character literals for Unicode (control) characters Lars Ingebrigtsen 2016-03-03 6:20 ` John Wiegley 2016-03-03 6:25 ` Lars Ingebrigtsen 2016-03-03 6:34 ` Drew Adams 2016-03-03 16:11 ` Paul Eggert 2016-03-03 20:48 ` Eli Zaretskii 2016-03-03 23:58 ` Paul Eggert 2016-03-05 15:28 ` Philipp Stephani 2016-03-05 15:39 ` Marcin Borkowski 2016-03-05 16:51 ` Philipp Stephani 2016-03-06 2:27 ` John Wiegley 2016-03-06 15:24 ` Philipp Stephani 2016-03-06 15:54 ` Eli Zaretskii 2016-03-06 17:35 ` Philipp Stephani 2016-03-06 18:08 ` Paul Eggert 2016-03-06 18:28 ` Philipp Stephani 2016-03-06 19:03 ` Paul Eggert 2016-03-06 19:16 ` Philipp Stephani 2016-03-06 20:05 ` Eli Zaretskii 2016-03-13 20:31 ` Philipp Stephani 2016-03-14 20:03 ` Paul Eggert 2016-03-14 20:30 ` Eli Zaretskii 2016-03-15 11:09 ` Nikolai Weibull 2016-03-15 17:10 ` Eli Zaretskii 2016-03-16 8:16 ` Nikolai Weibull 2016-03-14 21:27 ` Clément Pit--Claudel 2016-03-14 21:48 ` Paul Eggert 2016-03-19 16:27 ` Philipp Stephani 2016-03-20 12:58 ` Paul Eggert 2016-03-20 13:25 ` Philipp Stephani 2016-03-25 17:41 ` Philipp Stephani 2016-04-22 2:39 ` Paul Eggert 2016-04-22 7:57 ` Eli Zaretskii 2016-04-22 8:01 ` Eli Zaretskii 2016-04-22 9:39 ` Elias Mårtenson 2016-04-22 10:01 ` Eli Zaretskii 2016-04-25 17:48 ` Paul Eggert 2016-03-05 16:35 ` Clément Pit--Claudel 2016-03-05 17:12 ` Paul Eggert 2016-03-05 17:53 ` Clément Pit--Claudel 2016-03-05 18:16 ` Eli Zaretskii 2016-03-05 18:34 ` Clément Pit--Claudel 2016-03-05 18:56 ` Eli Zaretskii 2016-03-05 19:08 ` Drew Adams 2016-03-05 22:52 ` Clément Pit--Claudel 2016-03-06 15:49 ` Joost Kremers 2016-03-06 16:55 ` Drew Adams
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).