From: Paul Eggert <eggert@cs.ucla.edu>
To: Philipp Stephani <p.stephani2@gmail.com>
Cc: emacs-devel@gnu.org
Subject: Re: Character literals for Unicode (control) characters
Date: Thu, 21 Apr 2016 19:39:50 -0700 [thread overview]
Message-ID: <57198EF6.50001@cs.ucla.edu> (raw)
In-Reply-To: <CAArVCkQ26+TN9BNv3ApPBfs=vsUycBt+rmd9ospLPVLWeEDK6Q@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 371 bytes --]
Thanks for doing all that. I installed your patches into the Emacs
master, along with the attached further patch which omits the
undocumented support for escapes like "\N{CJK IDEOGRAPH-3400}" as I
couldn't see the utility of these over and above plain "\N{U+3400}",
plus it wasn't clear why CJK ideographs needed special-case names
whereas other ideographs did not.
[-- Attachment #2: 0001-Improve-character-name-escapes.txt --]
[-- Type: text/plain, Size: 15431 bytes --]
From bd1c7ca67e7429e07f78d4ff49163fd7a67a6765 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Thu, 21 Apr 2016 19:26:34 -0700
Subject: [PATCH] Improve character name escapes
* doc/lispref/nonascii.texi (Character Properties):
Avoid duplication of Unicode names. Reformat examples to fit in
narrow pages.
* doc/lispref/objects.texi (General Escape Syntax):
Simplify and better-organize explanation of \N{...} escapes.
* src/character.h (CHAR_SURROGATE_PAIR_P): Remove; unused.
(char_surrogate_p): New inline function.
* src/lread.c: Do not include string.h; no longer needed.
(invalid_character_name, check_scalar_value): Remove; the ideas
behind these functions are now bundled into character_name_to_code.
(character_name_to_code): Remove undocumented support for "CJK
IDEOGRAPH-XXXX" names, as "U+XXXX" suffices. Reject monstrosities
like "\N{U+-0}" and null bytes in \N escapes. Reject floating
point in \N escapes instead of returning garbage. Use
AUTO_STRING_WITH_LEN to lessen pressure on the garbage collector.
* test/src/lread-tests.el (lread-char-number, lread-char-name)
(lread-string-char-number, lread-string-char-name):
Test runtime behavior, not compile-time, as the test framework
is not set up to test compile-time.
(lread-char-surrogate-1, lread-char-surrogate-2)
(lread-char-surrogate-3, lread-char-surrogate-4)
(lread-string-char-number-2, lread-string-char-number-3):
New tests.
(lread-string-char-number-1): Rename from lread-string-char-number.
---
doc/lispref/nonascii.texi | 15 ++++---
doc/lispref/objects.texi | 52 +++++++++++------------
src/character.h | 13 +++---
src/lread.c | 104 +++++++++++++---------------------------------
test/src/lread-tests.el | 32 ++++++++------
5 files changed, 89 insertions(+), 127 deletions(-)
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 66ad9ac..0e4aa86 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -622,18 +622,21 @@ Character Properties
@result{} Nd
@end group
@group
-;; U+2084 SUBSCRIPT FOUR
-(get-char-code-property ?\u2084 'digit-value)
+;; U+2084
+(get-char-code-property ?\N@{SUBSCRIPT FOUR@}
+ 'digit-value)
@result{} 4
@end group
@group
-;; U+2155 VULGAR FRACTION ONE FIFTH
-(get-char-code-property ?\u2155 'numeric-value)
+;; U+2155
+(get-char-code-property ?\N@{VULGAR FRACTION ONE FIFTH@}
+ 'numeric-value)
@result{} 0.2
@end group
@group
-;; U+2163 ROMAN NUMERAL FOUR
-(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value)
+;; U+2163
+(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@}
+ 'numeric-value)
@result{} 4
@end group
@group
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi
index 96b334d..54894b8 100644
--- a/doc/lispref/objects.texi
+++ b/doc/lispref/objects.texi
@@ -353,25 +353,32 @@ General Escape Syntax
control characters, Emacs provides several types of escape syntax that
you can use to specify non-@acronym{ASCII} text characters.
+@enumerate
+@item
@cindex @samp{\} in character constant
@cindex backslash in character constants
@cindex unicode character escape
- Firstly, you can specify characters by their Unicode values.
-@code{?\u@var{nnnn}} represents a character with Unicode code point
-@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal
-number with exactly four digits. The backslash indicates that the
-subsequent characters form an escape sequence, and the @samp{u}
-specifies a Unicode escape sequence.
-
- There is a slightly different syntax for specifying Unicode
-characters with code points higher than @code{U+@var{ffff}}:
-@code{?\U00@var{nnnnnn}} represents the character with code point
-@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal
-number. The Unicode Standard only defines code points up to
-@samp{U+@var{10ffff}}, so if you specify a code point higher than
-that, Emacs signals an error.
-
- Secondly, you can specify characters by their hexadecimal character
+You can specify characters by their Unicode names, if any.
+@code{?\N@{@var{NAME}@}} represents the Unicode character named
+@var{NAME}. Thus, @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}} is
+equivalent to @code{?à} and denotes the Unicode character U+00E0. To
+simplify entering multi-line strings, you can replace spaces in the
+names by non-empty sequences of whitespace (e.g., newlines).
+
+@item
+You can specify characters by their Unicode values.
+@code{?\N@{U+@var{X}@}} represents a character with Unicode code point
+@var{X}, where @var{X} is a hexadecimal number. Also,
+@code{?\u@var{xxxx}} and @code{?\U@var{xxxxxxxx}} represent code
+points @var{xxxx} and @var{xxxxxxxx}, respectively, where each @var{x}
+is a single hexadecimal digit. For example, @code{?\N@{U+E0@}},
+@code{?\u00e0} and @code{?\U000000E0} are all equivalent to @code{?à}
+and to @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}}. The Unicode
+Standard defines code points only up to @samp{U+@var{10ffff}}, so if
+you specify a code point higher than that, Emacs signals an error.
+
+@item
+You can specify characters by their hexadecimal character
codes. A hexadecimal escape sequence consists of a backslash,
@samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is
the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
@@ -379,23 +386,16 @@ General Escape Syntax
You can use any number of hex digits, so you can represent any
character code in this way.
+@item
@cindex octal character code
- Thirdly, you can specify characters by their character code in
+You can specify characters by their character code in
octal. An octal escape sequence consists of a backslash followed by
up to three octal digits; thus, @samp{?\101} for the character
@kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
for the character @kbd{C-b}. Only characters up to octal code 777 can
be specified this way.
- Fourthly, you can specify characters by their name. A character
-name escape sequence consists of a backslash, @samp{N@{}, the Unicode
-character name, and @samp{@}}. Alternatively, you can also put the
-numeric code point value between the braces, using the syntax
-@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight
-hexadecimal digits. Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and
-@samp{?\N@{U+41@}} both denote the character @kbd{A}. To simplify
-entering multi-line strings, you can replace spaces in the character
-names by arbitrary non-empty sequence of whitespace (e.g., newlines).
+@end enumerate
These escape sequences may also be used in strings. @xref{Non-ASCII
in Strings}.
diff --git a/src/character.h b/src/character.h
index bc3e155..586f330 100644
--- a/src/character.h
+++ b/src/character.h
@@ -612,14 +612,13 @@ sanitize_char_width (EMACS_INT width)
: (c) <= 0xE01EF ? (c) - 0xE0100 + 17 \
: 0)
-/* If C is a high surrogate, return 1. If C is a low surrogate,
- return 2. Otherwise, return 0. */
+/* Return true if C is a surrogate. */
-#define CHAR_SURROGATE_PAIR_P(c) \
- ((c) < 0xD800 ? 0 \
- : (c) <= 0xDBFF ? 1 \
- : (c) <= 0xDFFF ? 2 \
- : 0)
+INLINE bool
+char_surrogate_p (int c)
+{
+ return 0xD800 <= c && c <= 0xDFFF;
+}
/* Data type for Unicode general category.
diff --git a/src/lread.c b/src/lread.c
index c3b6bd7..a42c1f6 100644
--- a/src/lread.c
+++ b/src/lread.c
@@ -44,7 +44,6 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
#include "termhooks.h"
#include "blockinput.h"
#include <c-ctype.h>
-#include <string.h>
#ifdef MSDOS
#include "msdos.h"
@@ -2151,88 +2150,42 @@ grow_read_buffer (void)
MAX_MULTIBYTE_LENGTH, -1, 1);
}
-/* Signal an invalid-read-syntax error indicating that the character
- name in an \N{…} literal is invalid. */
-static _Noreturn void
-invalid_character_name (Lisp_Object name)
-{
- AUTO_STRING (format, "\\N{%s}");
- xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, name));
-}
-
-/* Check that CODE is a valid Unicode scalar value, and return its
- value. CODE should be parsed from the character name given by
- NAME. NAME is used for error messages. */
+/* Return the scalar value that has the Unicode character name NAME.
+ Raise 'invalid-read-syntax' if there is no such character. */
static int
-check_scalar_value (Lisp_Object code, Lisp_Object name)
+character_name_to_code (char const *name, ptrdiff_t name_len)
{
- if (! NUMBERP (code))
- invalid_character_name (name);
- EMACS_INT i = XINT (code);
- if (! (0 <= i && i <= MAX_UNICODE_CHAR)
- /* Don't allow surrogates. */
- || (0xD800 <= code && code <= 0xDFFF))
- invalid_character_name (name);
- return i;
-}
+ Lisp_Object code;
-/* If NAME starts with PREFIX, interpret the rest as a hexadecimal
- number and return its value. Raise invalid-read-syntax if the
- number is not a valid scalar value. Return −1 if NAME doesn’t
- start with PREFIX. */
-static int
-parse_code_after_prefix (Lisp_Object name, const char *prefix)
-{
- ptrdiff_t name_len = SBYTES (name);
- ptrdiff_t prefix_len = strlen (prefix);
- /* Allow between one and eight hexadecimal digits after the
- prefix. */
- if (prefix_len < name_len && name_len <= prefix_len + 8
- && memcmp (SDATA (name), prefix, prefix_len) == 0)
+ /* Code point as U+XXXX.... */
+ if (name[0] == 'U' && name[1] == '+')
{
- Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false);
- if (NUMBERP (code))
- return check_scalar_value (code, name);
+ /* Pass the leading '+' to string_to_number, so that it
+ rejects monstrosities such as negative values. */
+ code = string_to_number (name + 1, 16, false);
+ }
+ else
+ {
+ /* Look up the name in the table returned by 'ucs-names'. */
+ AUTO_STRING_WITH_LEN (namestr, name, name_len);
+ Lisp_Object names = call0 (Qucs_names);
+ code = CDR (Fassoc (namestr, names));
}
- return -1;
-}
-/* Returns the scalar value that has the Unicode character name NAME.
- Raises `invalid-read-syntax' if there is no such character. */
-static int
-character_name_to_code (Lisp_Object name)
-{
- /* Code point as U+N, where N is between 1 and 8 hexadecimal
- digits. */
- int code = parse_code_after_prefix (name, "U+");
- if (code >= 0)
- return code;
-
- /* CJK ideographs are not contained in the association list returned
- by `ucs-names'. But they follow a predictable naming pattern: a
- fixed prefix plus the hexadecimal codepoint value. */
- code = parse_code_after_prefix (name, "CJK IDEOGRAPH-");
- if (code >= 0)
+ if (! (INTEGERP (code)
+ && 0 <= XINT (code) && XINT (code) <= MAX_UNICODE_CHAR
+ && ! char_surrogate_p (XINT (code))))
{
- /* Various ranges of CJK characters; see UnicodeData.txt. */
- if ((0x3400 <= code && code <= 0x4DB5)
- || (0x4E00 <= code && code <= 0x9FD5)
- || (0x20000 <= code && code <= 0x2A6D6)
- || (0x2A700 <= code && code <= 0x2B734)
- || (0x2B740 <= code && code <= 0x2B81D)
- || (0x2B820 <= code && code <= 0x2CEA1))
- return code;
- else
- invalid_character_name (name);
+ AUTO_STRING (format, "\\N{%s}");
+ AUTO_STRING_WITH_LEN (namestr, name, name_len);
+ xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, namestr));
}
- /* Look up the name in the table returned by `ucs-names'. */
- Lisp_Object names = call0 (Qucs_names);
- return check_scalar_value (CDR (Fassoc (name, names)), name);
+ return XINT (code);
}
/* Bound on the length of a Unicode character name. As of
- Unicode 9.0.0 the maximum is 83, so this should be safe. */
+ Unicode 9.0.0 the maximum is 83, so this should be safe. */
enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 };
/* Read a \-escape sequence, assuming we already read the `\'.
@@ -2458,14 +2411,14 @@ read_escape (Lisp_Object readcharfun, bool stringp)
end_of_file_error ();
if (c == '}')
break;
- if (! c_isascii (c))
+ if (! (0 < c && c < 0x80))
{
AUTO_STRING (format,
- "Non-ASCII character U+%04X in character name");
+ "Invalid character U+%04X in character name");
xsignal1 (Qinvalid_read_syntax,
CALLN (Fformat, format, make_natnum (c)));
}
- /* We treat multiple adjacent whitespace characters as a
+ /* Treat multiple adjacent whitespace characters as a
single space character. This makes it easier to use
character names in e.g. multi-line strings. */
if (c_isspace (c))
@@ -2483,7 +2436,8 @@ read_escape (Lisp_Object readcharfun, bool stringp)
}
if (length == 0)
invalid_syntax ("Empty character name");
- return character_name_to_code (make_unibyte_string (name, length));
+ name[length] = '\0';
+ return character_name_to_code (name, length);
}
default:
diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el
index ff5d0f6..2ebaf49 100644
--- a/test/src/lread-tests.el
+++ b/test/src/lread-tests.el
@@ -1,6 +1,6 @@
;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*-
-;; Copyright (C) 2016 Google Inc.
+;; Copyright (C) 2016 Free Software Foundation, Inc.
;; Author: Philipp Stephani <phst@google.com>
@@ -26,11 +26,10 @@
;;; Code:
(ert-deftest lread-char-number ()
- (should (equal ?\N{U+A817} #xA817)))
+ (should (equal (read "?\\N{U+A817}") #xA817)))
(ert-deftest lread-char-name ()
- (should (equal ?\N{SYLOTI NAGRI LETTER
- DHO}
+ (should (equal (read "?\\N{SYLOTI NAGRI LETTER \n DHO}")
#xA817)))
(ert-deftest lread-char-invalid-number ()
@@ -46,16 +45,23 @@
(ert-deftest lread-char-empty-name ()
(should-error (read "?\\N{}") :type 'invalid-read-syntax))
-(ert-deftest lread-char-cjk-name ()
- (should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734)))
-
-(ert-deftest lread-char-invalid-cjk-name ()
- (should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax))
-
-(ert-deftest lread-string-char-number ()
- (should (equal "a\N{U+A817}b" "a\uA817b")))
+(ert-deftest lread-char-surrogate-1 ()
+ (should-error (read "?\\N{U+D800}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-2 ()
+ (should-error (read "?\\N{U+D801}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-3 ()
+ (should-error (read "?\\N{U+Dffe}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-4 ()
+ (should-error (read "?\\N{U+DFFF}") :type 'invalid-read-syntax))
+
+(ert-deftest lread-string-char-number-1 ()
+ (should (equal (read "a\\N{U+A817}b") "a\uA817bx")))
+(ert-deftest lread-string-char-number-2 ()
+ (should-error (read "?\\N{0.5}") :type 'invalid-read-syntax))
+(ert-deftest lread-string-char-number-3 ()
+ (should-error (read "?\\N{U+-0}") :type 'invalid-read-syntax))
(ert-deftest lread-string-char-name ()
- (should (equal "a\N{SYLOTI NAGRI LETTER DHO}b" "a\uA817b")))
+ (should (equal (read "a\\N{SYLOTI NAGRI LETTER DHO}b") "a\uA817b")))
;;; lread-tests.el ends here
--
2.5.5
next prev parent reply other threads:[~2016-04-22 2:39 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-03 5:47 Character literals for Unicode (control) characters Lars Ingebrigtsen
2016-03-03 6:20 ` John Wiegley
2016-03-03 6:25 ` Lars Ingebrigtsen
2016-03-03 6:34 ` Drew Adams
2016-03-03 16:11 ` Paul Eggert
2016-03-03 20:48 ` Eli Zaretskii
2016-03-03 23:58 ` Paul Eggert
2016-03-05 15:28 ` Philipp Stephani
2016-03-05 15:39 ` Marcin Borkowski
2016-03-05 16:51 ` Philipp Stephani
2016-03-06 2:27 ` John Wiegley
2016-03-06 15:24 ` Philipp Stephani
2016-03-06 15:54 ` Eli Zaretskii
2016-03-06 17:35 ` Philipp Stephani
2016-03-06 18:08 ` Paul Eggert
2016-03-06 18:28 ` Philipp Stephani
2016-03-06 19:03 ` Paul Eggert
2016-03-06 19:16 ` Philipp Stephani
2016-03-06 20:05 ` Eli Zaretskii
2016-03-13 20:31 ` Philipp Stephani
2016-03-14 20:03 ` Paul Eggert
2016-03-14 20:30 ` Eli Zaretskii
2016-03-15 11:09 ` Nikolai Weibull
2016-03-15 17:10 ` Eli Zaretskii
2016-03-16 8:16 ` Nikolai Weibull
2016-03-14 21:27 ` Clément Pit--Claudel
2016-03-14 21:48 ` Paul Eggert
2016-03-19 16:27 ` Philipp Stephani
2016-03-20 12:58 ` Paul Eggert
2016-03-20 13:25 ` Philipp Stephani
2016-03-25 17:41 ` Philipp Stephani
2016-04-22 2:39 ` Paul Eggert [this message]
2016-04-22 7:57 ` Eli Zaretskii
2016-04-22 8:01 ` Eli Zaretskii
2016-04-22 9:39 ` Elias Mårtenson
2016-04-22 10:01 ` Eli Zaretskii
2016-04-25 17:48 ` Paul Eggert
2016-03-05 16:35 ` Clément Pit--Claudel
2016-03-05 17:12 ` Paul Eggert
2016-03-05 17:53 ` Clément Pit--Claudel
2016-03-05 18:16 ` Eli Zaretskii
2016-03-05 18:34 ` Clément Pit--Claudel
2016-03-05 18:56 ` Eli Zaretskii
2016-03-05 19:08 ` Drew Adams
2016-03-05 22:52 ` Clément Pit--Claudel
2016-03-06 15:49 ` Joost Kremers
2016-03-06 16:55 ` Drew Adams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57198EF6.50001@cs.ucla.edu \
--to=eggert@cs.ucla.edu \
--cc=emacs-devel@gnu.org \
--cc=p.stephani2@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).