Re: Character literals for Unicode (control) characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Paul Eggert <eggert@cs.ucla.edu>
To: Philipp Stephani <p.stephani2@gmail.com>
Cc: emacs-devel@gnu.org
Subject: Re: Character literals for Unicode (control) characters
Date: Thu, 21 Apr 2016 19:39:50 -0700	[thread overview]
Message-ID: <57198EF6.50001@cs.ucla.edu> (raw)
In-Reply-To: <CAArVCkQ26+TN9BNv3ApPBfs=vsUycBt+rmd9ospLPVLWeEDK6Q@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 371 bytes --]

Thanks for doing all that. I installed your patches into the Emacs 
master, along with the attached further patch which omits the 
undocumented support for escapes like "\N{CJK IDEOGRAPH-3400}" as I 
couldn't see the utility of these over and above plain "\N{U+3400}", 
plus it wasn't clear why CJK ideographs needed special-case names 
whereas other ideographs did not.

[-- Attachment #2: 0001-Improve-character-name-escapes.txt --]
[-- Type: text/plain, Size: 15431 bytes --]

From bd1c7ca67e7429e07f78d4ff49163fd7a67a6765 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Thu, 21 Apr 2016 19:26:34 -0700
Subject: [PATCH] Improve character name escapes

* doc/lispref/nonascii.texi (Character Properties):
Avoid duplication of Unicode names.  Reformat examples to fit in
narrow pages.
* doc/lispref/objects.texi (General Escape Syntax):
Simplify and better-organize explanation of \N{...} escapes.
* src/character.h (CHAR_SURROGATE_PAIR_P): Remove; unused.
(char_surrogate_p): New inline function.
* src/lread.c: Do not include string.h; no longer needed.
(invalid_character_name, check_scalar_value): Remove; the ideas
behind these functions are now bundled into character_name_to_code.
(character_name_to_code): Remove undocumented support for "CJK
IDEOGRAPH-XXXX" names, as "U+XXXX" suffices.  Reject monstrosities
like "\N{U+-0}" and null bytes in \N escapes.  Reject floating
point in \N escapes instead of returning garbage.  Use
AUTO_STRING_WITH_LEN to lessen pressure on the garbage collector.
* test/src/lread-tests.el (lread-char-number, lread-char-name)
(lread-string-char-number, lread-string-char-name):
Test runtime behavior, not compile-time, as the test framework
is not set up to test compile-time.
(lread-char-surrogate-1, lread-char-surrogate-2)
(lread-char-surrogate-3, lread-char-surrogate-4)
(lread-string-char-number-2, lread-string-char-number-3):
New tests.
(lread-string-char-number-1): Rename from lread-string-char-number.
---
 doc/lispref/nonascii.texi |  15 ++++---
 doc/lispref/objects.texi  |  52 +++++++++++------------
 src/character.h           |  13 +++---
 src/lread.c               | 104 +++++++++++++---------------------------------
 test/src/lread-tests.el   |  32 ++++++++------
 5 files changed, 89 insertions(+), 127 deletions(-)

diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 66ad9ac..0e4aa86 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -622,18 +622,21 @@ Character Properties
      @result{} Nd
 @end group
 @group
-;; U+2084 SUBSCRIPT FOUR
-(get-char-code-property ?\u2084 'digit-value)
+;; U+2084
+(get-char-code-property ?\N@{SUBSCRIPT FOUR@}
+                        'digit-value)
      @result{} 4
 @end group
 @group
-;; U+2155 VULGAR FRACTION ONE FIFTH
-(get-char-code-property ?\u2155 'numeric-value)
+;; U+2155
+(get-char-code-property ?\N@{VULGAR FRACTION ONE FIFTH@}
+                        'numeric-value)
      @result{} 0.2
 @end group
 @group
-;; U+2163 ROMAN NUMERAL FOUR
-(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value)
+;; U+2163
+(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@}
+                        'numeric-value)
      @result{} 4
 @end group
 @group
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi
index 96b334d..54894b8 100644
--- a/doc/lispref/objects.texi
+++ b/doc/lispref/objects.texi
@@ -353,25 +353,32 @@ General Escape Syntax
 control characters, Emacs provides several types of escape syntax that
 you can use to specify non-@acronym{ASCII} text characters.
 
+@enumerate
+@item
 @cindex @samp{\} in character constant
 @cindex backslash in character constants
 @cindex unicode character escape
-  Firstly, you can specify characters by their Unicode values.
-@code{?\u@var{nnnn}} represents a character with Unicode code point
-@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal
-number with exactly four digits.  The backslash indicates that the
-subsequent characters form an escape sequence, and the @samp{u}
-specifies a Unicode escape sequence.
-
-  There is a slightly different syntax for specifying Unicode
-characters with code points higher than @code{U+@var{ffff}}:
-@code{?\U00@var{nnnnnn}} represents the character with code point
-@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal
-number.  The Unicode Standard only defines code points up to
-@samp{U+@var{10ffff}}, so if you specify a code point higher than
-that, Emacs signals an error.
-
-  Secondly, you can specify characters by their hexadecimal character
+You can specify characters by their Unicode names, if any.
+@code{?\N@{@var{NAME}@}} represents the Unicode character named
+@var{NAME}.  Thus, @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}} is
+equivalent to @code{?à} and denotes the Unicode character U+00E0.  To
+simplify entering multi-line strings, you can replace spaces in the
+names by non-empty sequences of whitespace (e.g., newlines).
+
+@item
+You can specify characters by their Unicode values.
+@code{?\N@{U+@var{X}@}} represents a character with Unicode code point
+@var{X}, where @var{X} is a hexadecimal number.  Also,
+@code{?\u@var{xxxx}} and @code{?\U@var{xxxxxxxx}} represent code
+points @var{xxxx} and @var{xxxxxxxx}, respectively, where each @var{x}
+is a single hexadecimal digit.  For example, @code{?\N@{U+E0@}},
+@code{?\u00e0} and @code{?\U000000E0} are all equivalent to @code{?à}
+and to @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}}.  The Unicode
+Standard defines code points only up to @samp{U+@var{10ffff}}, so if
+you specify a code point higher than that, Emacs signals an error.
+
+@item
+You can specify characters by their hexadecimal character
 codes.  A hexadecimal escape sequence consists of a backslash,
 @samp{x}, and the hexadecimal character code.  Thus, @samp{?\x41} is
 the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
@@ -379,23 +386,16 @@ General Escape Syntax
 You can use any number of hex digits, so you can represent any
 character code in this way.
 
+@item
 @cindex octal character code
-  Thirdly, you can specify characters by their character code in
+You can specify characters by their character code in
 octal.  An octal escape sequence consists of a backslash followed by
 up to three octal digits; thus, @samp{?\101} for the character
 @kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
 for the character @kbd{C-b}.  Only characters up to octal code 777 can
 be specified this way.
 
-  Fourthly, you can specify characters by their name.  A character
-name escape sequence consists of a backslash, @samp{N@{}, the Unicode
-character name, and @samp{@}}.  Alternatively, you can also put the
-numeric code point value between the braces, using the syntax
-@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight
-hexadecimal digits.  Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and
-@samp{?\N@{U+41@}} both denote the character @kbd{A}.  To simplify
-entering multi-line strings, you can replace spaces in the character
-names by arbitrary non-empty sequence of whitespace (e.g., newlines).
+@end enumerate
 
   These escape sequences may also be used in strings.  @xref{Non-ASCII
 in Strings}.
diff --git a/src/character.h b/src/character.h
index bc3e155..586f330 100644
--- a/src/character.h
+++ b/src/character.h
@@ -612,14 +612,13 @@ sanitize_char_width (EMACS_INT width)
    : (c) <= 0xE01EF ? (c) - 0xE0100 + 17	\
    : 0)
 
-/* If C is a high surrogate, return 1.  If C is a low surrogate,
-   return 2.  Otherwise, return 0.  */
+/* Return true if C is a surrogate.  */
 
-#define CHAR_SURROGATE_PAIR_P(c)	\
-  ((c) < 0xD800 ? 0			\
-   : (c) <= 0xDBFF ? 1			\
-   : (c) <= 0xDFFF ? 2			\
-   : 0)
+INLINE bool
+char_surrogate_p (int c)
+{
+  return 0xD800 <= c && c <= 0xDFFF;
+}
 
 /* Data type for Unicode general category.
 
diff --git a/src/lread.c b/src/lread.c
index c3b6bd7..a42c1f6 100644
--- a/src/lread.c
+++ b/src/lread.c
@@ -44,7 +44,6 @@ along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.  */
 #include "termhooks.h"
 #include "blockinput.h"
 #include <c-ctype.h>
-#include <string.h>
 
 #ifdef MSDOS
 #include "msdos.h"
@@ -2151,88 +2150,42 @@ grow_read_buffer (void)
 			 MAX_MULTIBYTE_LENGTH, -1, 1);
 }
 
-/* Signal an invalid-read-syntax error indicating that the character
-   name in an \N{…} literal is invalid.  */
-static _Noreturn void
-invalid_character_name (Lisp_Object name)
-{
-  AUTO_STRING (format, "\\N{%s}");
-  xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, name));
-}
-
-/* Check that CODE is a valid Unicode scalar value, and return its
-   value.  CODE should be parsed from the character name given by
-   NAME.  NAME is used for error messages.  */
+/* Return the scalar value that has the Unicode character name NAME.
+   Raise 'invalid-read-syntax' if there is no such character.  */
 static int
-check_scalar_value (Lisp_Object code, Lisp_Object name)
+character_name_to_code (char const *name, ptrdiff_t name_len)
 {
-  if (! NUMBERP (code))
-    invalid_character_name (name);
-  EMACS_INT i = XINT (code);
-  if (! (0 <= i && i <= MAX_UNICODE_CHAR)
-      /* Don't allow surrogates.  */
-      || (0xD800 <= code && code <= 0xDFFF))
-    invalid_character_name (name);
-  return i;
-}
+  Lisp_Object code;
 
-/* If NAME starts with PREFIX, interpret the rest as a hexadecimal
-   number and return its value.  Raise invalid-read-syntax if the
-   number is not a valid scalar value.  Return −1 if NAME doesn’t
-   start with PREFIX.  */
-static int
-parse_code_after_prefix (Lisp_Object name, const char *prefix)
-{
-  ptrdiff_t name_len = SBYTES (name);
-  ptrdiff_t prefix_len = strlen (prefix);
-  /* Allow between one and eight hexadecimal digits after the
-     prefix.  */
-  if (prefix_len < name_len && name_len <= prefix_len + 8
-      && memcmp (SDATA (name), prefix, prefix_len) == 0)
+  /* Code point as U+XXXX....  */
+  if (name[0] == 'U' && name[1] == '+')
     {
-      Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false);
-      if (NUMBERP (code))
-        return check_scalar_value (code, name);
+      /* Pass the leading '+' to string_to_number, so that it
+	 rejects monstrosities such as negative values.  */
+      code = string_to_number (name + 1, 16, false);
+    }
+  else
+    {
+      /* Look up the name in the table returned by 'ucs-names'.  */
+      AUTO_STRING_WITH_LEN (namestr, name, name_len);
+      Lisp_Object names = call0 (Qucs_names);
+      code = CDR (Fassoc (namestr, names));
     }
-  return -1;
-}
 
-/* Returns the scalar value that has the Unicode character name NAME.
-   Raises `invalid-read-syntax' if there is no such character.  */
-static int
-character_name_to_code (Lisp_Object name)
-{
-  /* Code point as U+N, where N is between 1 and 8 hexadecimal
-     digits.  */
-  int code = parse_code_after_prefix (name, "U+");
-  if (code >= 0)
-    return code;
-
-  /* CJK ideographs are not contained in the association list returned
-     by `ucs-names'.  But they follow a predictable naming pattern: a
-     fixed prefix plus the hexadecimal codepoint value.  */
-  code = parse_code_after_prefix (name, "CJK IDEOGRAPH-");
-  if (code >= 0)
+  if (! (INTEGERP (code)
+	 && 0 <= XINT (code) && XINT (code) <= MAX_UNICODE_CHAR
+	 && ! char_surrogate_p (XINT (code))))
     {
-      /* Various ranges of CJK characters; see UnicodeData.txt.  */
-      if ((0x3400 <= code && code <= 0x4DB5)
-          || (0x4E00 <= code && code <= 0x9FD5)
-          || (0x20000 <= code && code <= 0x2A6D6)
-          || (0x2A700 <= code && code <= 0x2B734)
-          || (0x2B740 <= code && code <= 0x2B81D)
-          || (0x2B820 <= code && code <= 0x2CEA1))
-        return code;
-      else
-        invalid_character_name (name);
+      AUTO_STRING (format, "\\N{%s}");
+      AUTO_STRING_WITH_LEN (namestr, name, name_len);
+      xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, namestr));
     }
 
-  /* Look up the name in the table returned by `ucs-names'.  */
-  Lisp_Object names = call0 (Qucs_names);
-  return check_scalar_value (CDR (Fassoc (name, names)), name);
+  return XINT (code);
 }
 
 /* Bound on the length of a Unicode character name.  As of
-   Unicode 9.0.0 the maximum is 83, so this should be safe. */
+   Unicode 9.0.0 the maximum is 83, so this should be safe.  */
 enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 };
 
 /* Read a \-escape sequence, assuming we already read the `\'.
@@ -2458,14 +2411,14 @@ read_escape (Lisp_Object readcharfun, bool stringp)
               end_of_file_error ();
             if (c == '}')
               break;
-            if (! c_isascii (c))
+            if (! (0 < c && c < 0x80))
               {
                 AUTO_STRING (format,
-                             "Non-ASCII character U+%04X in character name");
+                             "Invalid character U+%04X in character name");
                 xsignal1 (Qinvalid_read_syntax,
                           CALLN (Fformat, format, make_natnum (c)));
               }
-            /* We treat multiple adjacent whitespace characters as a
+            /* Treat multiple adjacent whitespace characters as a
                single space character.  This makes it easier to use
                character names in e.g. multi-line strings.  */
             if (c_isspace (c))
@@ -2483,7 +2436,8 @@ read_escape (Lisp_Object readcharfun, bool stringp)
           }
         if (length == 0)
           invalid_syntax ("Empty character name");
-        return character_name_to_code (make_unibyte_string (name, length));
+	name[length] = '\0';
+	return character_name_to_code (name, length);
       }
 
     default:
diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el
index ff5d0f6..2ebaf49 100644
--- a/test/src/lread-tests.el
+++ b/test/src/lread-tests.el
@@ -1,6 +1,6 @@
 ;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*-
 
-;; Copyright (C) 2016  Google Inc.
+;; Copyright (C) 2016 Free Software Foundation, Inc.
 
 ;; Author: Philipp Stephani <phst@google.com>
 
@@ -26,11 +26,10 @@
 ;;; Code:
 
 (ert-deftest lread-char-number ()
-  (should (equal ?\N{U+A817} #xA817)))
+  (should (equal (read "?\\N{U+A817}") #xA817)))
 
 (ert-deftest lread-char-name ()
-  (should (equal ?\N{SYLOTI  NAGRI LETTER
-                 DHO}
+  (should (equal (read "?\\N{SYLOTI  NAGRI LETTER \n DHO}")
                  #xA817)))
 
 (ert-deftest lread-char-invalid-number ()
@@ -46,16 +45,23 @@
 (ert-deftest lread-char-empty-name ()
   (should-error (read "?\\N{}") :type 'invalid-read-syntax))
 
-(ert-deftest lread-char-cjk-name ()
-  (should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734)))
-
-(ert-deftest lread-char-invalid-cjk-name ()
-  (should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax))
-
-(ert-deftest lread-string-char-number ()
-  (should (equal "a\N{U+A817}b" "a\uA817b")))
+(ert-deftest lread-char-surrogate-1 ()
+  (should-error (read "?\\N{U+D800}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-2 ()
+  (should-error (read "?\\N{U+D801}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-3 ()
+  (should-error (read "?\\N{U+Dffe}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-4 ()
+  (should-error (read "?\\N{U+DFFF}") :type 'invalid-read-syntax))
+
+(ert-deftest lread-string-char-number-1 ()
+  (should (equal (read "a\\N{U+A817}b") "a\uA817bx")))
+(ert-deftest lread-string-char-number-2 ()
+  (should-error (read "?\\N{0.5}") :type 'invalid-read-syntax))
+(ert-deftest lread-string-char-number-3 ()
+  (should-error (read "?\\N{U+-0}") :type 'invalid-read-syntax))
 
 (ert-deftest lread-string-char-name ()
-  (should (equal "a\N{SYLOTI NAGRI  LETTER DHO}b" "a\uA817b")))
+  (should (equal (read "a\\N{SYLOTI NAGRI  LETTER DHO}b") "a\uA817b")))
 
 ;;; lread-tests.el ends here
-- 
2.5.5

next prev parent reply	other threads:[~2016-04-22  2:39 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-03  5:47 Character literals for Unicode (control) characters Lars Ingebrigtsen
2016-03-03  6:20 ` John Wiegley
2016-03-03  6:25   ` Lars Ingebrigtsen
2016-03-03  6:34 ` Drew Adams
2016-03-03 16:11 ` Paul Eggert
2016-03-03 20:48   ` Eli Zaretskii
2016-03-03 23:58     ` Paul Eggert
2016-03-05 15:28   ` Philipp Stephani
2016-03-05 15:39     ` Marcin Borkowski
2016-03-05 16:51       ` Philipp Stephani
2016-03-06  2:27     ` John Wiegley
2016-03-06 15:24       ` Philipp Stephani
2016-03-06 15:54         ` Eli Zaretskii
2016-03-06 17:35           ` Philipp Stephani
2016-03-06 18:08             ` Paul Eggert
2016-03-06 18:28               ` Philipp Stephani
2016-03-06 19:03                 ` Paul Eggert
2016-03-06 19:16                   ` Philipp Stephani
2016-03-06 20:05                     ` Eli Zaretskii
2016-03-13 20:31                       ` Philipp Stephani
2016-03-14 20:03                         ` Paul Eggert
2016-03-14 20:30                           ` Eli Zaretskii
2016-03-15 11:09                             ` Nikolai Weibull
2016-03-15 17:10                               ` Eli Zaretskii
2016-03-16  8:16                                 ` Nikolai Weibull
2016-03-14 21:27                           ` Clément Pit--Claudel
2016-03-14 21:48                             ` Paul Eggert
2016-03-19 16:27                           ` Philipp Stephani
2016-03-20 12:58                             ` Paul Eggert
2016-03-20 13:25                               ` Philipp Stephani
2016-03-25 17:41                                 ` Philipp Stephani
2016-04-22  2:39                                   ` Paul Eggert [this message]
2016-04-22  7:57                                     ` Eli Zaretskii
2016-04-22  8:01                                       ` Eli Zaretskii
2016-04-22  9:39                                         ` Elias Mårtenson
2016-04-22 10:01                                           ` Eli Zaretskii
2016-04-25 17:48                                             ` Paul Eggert
2016-03-05 16:35   ` Clément Pit--Claudel
2016-03-05 17:12     ` Paul Eggert
2016-03-05 17:53       ` Clément Pit--Claudel
2016-03-05 18:16         ` Eli Zaretskii
2016-03-05 18:34           ` Clément Pit--Claudel
2016-03-05 18:56             ` Eli Zaretskii
2016-03-05 19:08               ` Drew Adams
2016-03-05 22:52                 ` Clément Pit--Claudel
2016-03-06 15:49           ` Joost Kremers
2016-03-06 16:55             ` Drew Adams

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:66ad9ac dfblob:0e4aa86 dfblob:96b334d dfblob:54894b8
dfblob:bc3e155 dfblob:586f330 dfblob:c3b6bd7 dfblob:a42c1f6
dfblob:ff5d0f6 dfblob:2ebaf49 )
 OR (
bs:"Improve character name escapes" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57198EF6.50001@cs.ucla.edu \
    --to=eggert@cs.ucla.edu \
    --cc=emacs-devel@gnu.org \
    --cc=p.stephani2@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).