Re: Character literals for Unicode (control) characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Philipp Stephani <p.stephani2@gmail.com>
To: Paul Eggert <eggert@cs.ucla.edu>, Eli Zaretskii <eliz@gnu.org>
Cc: larsi@gnus.org, johnw@gnu.org, emacs-devel@gnu.org
Subject: Re: Character literals for Unicode (control) characters
Date: Sun, 06 Mar 2016 18:28:09 +0000	[thread overview]
Message-ID: <CAArVCkR3r34qp+iM8t5FaRBVvjjE492A_EQyB05dVPe6inA34w@mail.gmail.com> (raw)
In-Reply-To: <56DC7227.10708@cs.ucla.edu>


[-- Attachment #1.1: Type: text/plain, Size: 2388 bytes --]

Paul Eggert <eggert@cs.ucla.edu> schrieb am So., 6. März 2016 um 19:08 Uhr:

> Thanks for taking this on. Some comments:
>
> Why the hash table? Existing Lisp code dealing with Unicode names uses an
> alist,
> and it seems to do OK.


Hash tables are as easy to use as alists, but have average O(1) lookup
time, as opposed to O(n) time for alists. Also alists are more prone to
cache invalidation because they are less contiguous.


> If a hash table is needed, a hash table should also be
> used by the existing code elsewhere that does something similar. See the
> function ucs-names and its callers.
>

Initially I used ucs-names, but the decided against it because it lacks
most characters. That's OK for a tables used for completion, but for
inputting all characters should be present. So the use cases are different.


>
> If a hash table is needed, I suggest using a perfect hashing function
> (generated
> by gperf) and checking its results with get-char-code-property. That
> avoids the
> runtime overhead of initialization.
>

Sounds good, but that would require much more effort and would delay this
project unnecessarily. It can be done later once the basic functionality is
in place.


>
> It needs documentation, both in the Emacs Lisp manual and in NEWS.
>
>
Yes, I've attached a patch.


>
>  > +void init_character_names ()
>  > +{
>
> The usual style is:
>
> void
> init_character_names (void)
> {
>
>
> No need for "const" for local variables (cost exceeds benefit).
>

Removed.


>
>
>  >             if (c_isspace (c))
>  >               {
>  >                 if (! whitespace)
>  >                   {
>  >                     whitespace = true;
>  >                     name[length++] = ' ';
>  >                   }
>  >               }
>  >             else
>  >               {
>  >                 whitespace = false;
>  >                 name[length++] = c;
>  >               }
>
> This would be a bit easier to follow (and most likely a tiny bit more
> efficient)
> as something like this:
>
> >       bool ws = c_isspace (c);
> >       if (ws)
> >         {
> >           length -= whitespace;
> >           c = ' ';
> >         }
> >       whitespace = ws;
> >       name[length++] = c;
>
>
I'd rather not have length decrease. Moved out the assignment, though.

[-- Attachment #1.2: Type: text/html, Size: 3746 bytes --]

[-- Attachment #2: 0002-Add-documentation-for-character-name-escapes.patch --]
[-- Type: application/octet-stream, Size: 2348 bytes --]

From d0d5219a358a2d8e853f1ce11cf16fb2629697c6 Mon Sep 17 00:00:00 2001
From: Philipp Stephani <phst@google.com>
Date: Sun, 6 Mar 2016 19:07:06 +0100
Subject: [PATCH 2/2] Add documentation for character name escapes

---
 doc/lispref/nonascii.texi |  2 +-
 doc/lispref/objects.texi  | 10 ++++++++++
 etc/NEWS                  |  5 +++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 9cf3b57..66ad9ac 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -633,7 +633,7 @@ Character Properties
 @end group
 @group
 ;; U+2163 ROMAN NUMERAL FOUR
-(get-char-code-property ?\u2163 'numeric-value)
+(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value)
      @result{} 4
 @end group
 @group
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi
index 3245930..96b334d 100644
--- a/doc/lispref/objects.texi
+++ b/doc/lispref/objects.texi
@@ -387,6 +387,16 @@ General Escape Syntax
 for the character @kbd{C-b}.  Only characters up to octal code 777 can
 be specified this way.
 
+  Fourthly, you can specify characters by their name.  A character
+name escape sequence consists of a backslash, @samp{N@{}, the Unicode
+character name, and @samp{@}}.  Alternatively, you can also put the
+numeric code point value between the braces, using the syntax
+@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight
+hexadecimal digits.  Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and
+@samp{?\N@{U+41@}} both denote the character @kbd{A}.  To simplify
+entering multi-line strings, you can replace spaces in the character
+names by arbitrary non-empty sequence of whitespace (e.g., newlines).
+
   These escape sequences may also be used in strings.  @xref{Non-ASCII
 in Strings}.
 
diff --git a/etc/NEWS b/etc/NEWS
index 92d69d2..9c77474 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -159,6 +159,11 @@ that negotiation should complete even on non-blocking sockets.
 `window-pixel-height-before-size-change' allow to detect which window
 changed size when `window-size-change-functions' are run.
 
++++
+** Emacs now supports character name escape sequences in character and
+string literals.  The syntax variants \N{character name} and
+\N{U+code} are supported.
+
 \f
 * Changes in Emacs 25.2 on Non-Free Operating Systems
 
-- 
2.7.0


[-- Attachment #3: 0003-Minor-cleanups-for-character-name-escapes.patch --]
[-- Type: application/octet-stream, Size: 2923 bytes --]

From 30e6d9dd4e83a36fe07bbeae678b3f086773346e Mon Sep 17 00:00:00 2001
From: Philipp Stephani <phst@google.com>
Date: Sun, 6 Mar 2016 19:27:21 +0100
Subject: [PATCH 3/3] Minor cleanups for character name escapes.

* lread.c (init_character_names): Add missing `void'.  Remove
top-level `const'.
(read_escape): Simplify loop a bit.  Remove top-level `const'.
---
 src/lread.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/lread.c b/src/lread.c
index 6e84fc8..4000637 100644
--- a/src/lread.c
+++ b/src/lread.c
@@ -2159,20 +2159,20 @@ static ptrdiff_t max_character_name_length;
 
 /* Initializes `character_names' and `max_character_name_length'.
    Called by `read_escape'.  */
-void init_character_names ()
+void init_character_names (void)
 {
   character_names = CALLN (Fmake_hash_table,
                            QCtest, Qequal,
                            /* Currently around 100,000 Unicode
                               characters are defined.  */
                            QCsize, make_natnum (100000));
-  const Lisp_Object get_property =
+  Lisp_Object get_property =
     Fsymbol_function (intern_c_string ("get-char-code-property"));
   ptrdiff_t length = 0;
   for (int i = 0; i <= MAX_UNICODE_CHAR; ++i)
     {
-      const Lisp_Object code = make_natnum (i);
-      const Lisp_Object name = call2 (get_property, code, Qname);
+      Lisp_Object code = make_natnum (i);
+      Lisp_Object name = call2 (get_property, code, Qname);
       if (NILP (name)) continue;
       CHECK_STRING (name);
       length = max (length, SBYTES (name));
@@ -2418,25 +2418,22 @@ read_escape (Lisp_Object readcharfun, bool stringp)
                character names in e.g. multi-line strings.  */
             if (c_isspace (c))
               {
-                if (! whitespace)
-                  {
-                    whitespace = true;
-                    name[length++] = ' ';
-                  }
+                if (whitespace)
+                  continue;
+                c = ' ';
+                whitespace = true;
               }
             else
-              {
-                whitespace = false;
-                name[length++] = c;
-              }
+              whitespace = false;
+            name[length++] = c;
             if (length >= max_character_name_length)
               invalid_syntax ("Character name too long");
           }
         if (length == 0)
           invalid_syntax ("Empty character name");
         name[length] = 0;
-        const Lisp_Object lisp_name = make_unibyte_string (name, length);
-        const Lisp_Object code =
+        Lisp_Object lisp_name = make_unibyte_string (name, length);
+        Lisp_Object code =
           (length >= 3 && length <= 10 && name[0] == 'U' && name[1] == '+') ?
           /* Code point as U+N, where N is between 1 and 8 hexadecimal
              digits.  */
-- 
2.7.0

next prev parent reply	other threads:[~2016-03-06 18:28 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-03  5:47 Character literals for Unicode (control) characters Lars Ingebrigtsen
2016-03-03  6:20 ` John Wiegley
2016-03-03  6:25   ` Lars Ingebrigtsen
2016-03-03  6:34 ` Drew Adams
2016-03-03 16:11 ` Paul Eggert
2016-03-03 20:48   ` Eli Zaretskii
2016-03-03 23:58     ` Paul Eggert
2016-03-05 15:28   ` Philipp Stephani
2016-03-05 15:39     ` Marcin Borkowski
2016-03-05 16:51       ` Philipp Stephani
2016-03-06  2:27     ` John Wiegley
2016-03-06 15:24       ` Philipp Stephani
2016-03-06 15:54         ` Eli Zaretskii
2016-03-06 17:35           ` Philipp Stephani
2016-03-06 18:08             ` Paul Eggert
2016-03-06 18:28               ` Philipp Stephani [this message]
2016-03-06 19:03                 ` Paul Eggert
2016-03-06 19:16                   ` Philipp Stephani
2016-03-06 20:05                     ` Eli Zaretskii
2016-03-13 20:31                       ` Philipp Stephani
2016-03-14 20:03                         ` Paul Eggert
2016-03-14 20:30                           ` Eli Zaretskii
2016-03-15 11:09                             ` Nikolai Weibull
2016-03-15 17:10                               ` Eli Zaretskii
2016-03-16  8:16                                 ` Nikolai Weibull
2016-03-14 21:27                           ` Clément Pit--Claudel
2016-03-14 21:48                             ` Paul Eggert
2016-03-19 16:27                           ` Philipp Stephani
2016-03-20 12:58                             ` Paul Eggert
2016-03-20 13:25                               ` Philipp Stephani
2016-03-25 17:41                                 ` Philipp Stephani
2016-04-22  2:39                                   ` Paul Eggert
2016-04-22  7:57                                     ` Eli Zaretskii
2016-04-22  8:01                                       ` Eli Zaretskii
2016-04-22  9:39                                         ` Elias Mårtenson
2016-04-22 10:01                                           ` Eli Zaretskii
2016-04-25 17:48                                             ` Paul Eggert
2016-03-05 16:35   ` Clément Pit--Claudel
2016-03-05 17:12     ` Paul Eggert
2016-03-05 17:53       ` Clément Pit--Claudel
2016-03-05 18:16         ` Eli Zaretskii
2016-03-05 18:34           ` Clément Pit--Claudel
2016-03-05 18:56             ` Eli Zaretskii
2016-03-05 19:08               ` Drew Adams
2016-03-05 22:52                 ` Clément Pit--Claudel
2016-03-06 15:49           ` Joost Kremers
2016-03-06 16:55             ` Drew Adams

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:9cf3b57 dfblob:3245930 dfblob:92d69d2 dfblob:6e84fc8
dfblob:66ad9ac dfblob:96b334d dfblob:9c77474 dfblob:4000637
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAArVCkR3r34qp+iM8t5FaRBVvjjE492A_EQyB05dVPe6inA34w@mail.gmail.com \
    --to=p.stephani2@gmail.com \
    --cc=eggert@cs.ucla.edu \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=johnw@gnu.org \
    --cc=larsi@gnus.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).