[PATCH] Allow inserting non-BMP characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* [PATCH] Allow inserting non-BMP characters
       [not found] <CAArVCkRx8p_vaFKJ_kXRuoZCKVBSYr=94RJANGpU0NXvkEZv6A@mail.gmail.com>
@ 2017-12-25 21:01 ` Philipp Stephani
  2017-12-26  1:26   ` Alan Third
  2017-12-26  4:46   ` Eli Zaretskii
  0 siblings, 2 replies; 17+ messages in thread
From: Philipp Stephani @ 2017-12-25 21:01 UTC (permalink / raw)
  To: emacs-devel; +Cc: Philipp Stephani

* src/character.h (char_low_surrogate_p, char_high_surrogate_p)
(surrogates_to_codepoint): New functions.

* src/nsterm.m (insertText:): Properly handle surrogate pairs.
---
 src/character.h | 26 ++++++++++++++++++++++++++
 src/nsterm.m    | 27 ++++++++++++++++++++-------
 2 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/src/character.h b/src/character.h
index c716885d46..8fa79c6bd9 100644
--- a/src/character.h
+++ b/src/character.h
@@ -614,6 +614,32 @@ char_surrogate_p (int c)
   return 0xD800 <= c && c <= 0xDFFF;
 }
 
+/* Return true if C is a low surrogate.  */
+
+INLINE bool
+char_low_surrogate_p (int c)
+{
+  return 0xDC00 <= c && c <= 0xDFFF;
+}
+
+/* Return true if C is a high surrogate.  */
+
+INLINE bool
+char_high_surrogate_p (int c)
+{
+  return 0xD800 <= c && c <= 0xDBFF;
+}
+
+/* Return the Unicode code point for the given UTF-16 surrogates.  */
+
+INLINE int
+surrogates_to_codepoint (int low, int high)
+{
+  eassert (char_low_surrogate_p (low));
+  eassert (char_high_surrogate_p (high));
+  return 0x10000 + (low - 0xDC00) + ((high - 0xD800) * 0x400);
+}
+
 /* Data type for Unicode general category.
 
    The order of members must be in sync with the 8th element of the
diff --git a/src/nsterm.m b/src/nsterm.m
index 07ac8f978f..da8d4dce6b 100644
--- a/src/nsterm.m
+++ b/src/nsterm.m
@@ -6283,14 +6283,13 @@ flag set (this is probably a bug in the OS).
          by doCommandBySelector: deleteBackward: */
 - (void)insertText: (id)aString
 {
-  int code;
-  int len = [(NSString *)aString length];
-  int i;
+  NSString *s = aString;
+  NSUInteger len = [s length];
 
   NSTRACE ("[EmacsView insertText:]");
 
   if (NS_KEYLOG)
-    NSLog (@"insertText '%@'\tlen = %d", aString, len);
+    NSLog (@"insertText '%@'\tlen = %lu", aString, (unsigned long) len);
   processingCompose = NO;
 
   if (!emacs_event)
@@ -6301,9 +6300,22 @@ - (void)insertText: (id)aString
     [self deleteWorkingText];
 
   /* now insert the string as keystrokes */
-  for (i =0; i<len; i++)
-    {
-      code = [aString characterAtIndex: i];
+  USE_SAFE_ALLOCA;
+  unichar *utf16_buffer;
+  SAFE_NALLOCA (utf16_buffer, 1, len);
+  [s getCharacters:utf16_buffer range:NSMakeRange (0, len)];
+  for (NSUInteger i = 0; i < len; i++)
+    {
+      NSUInteger code = utf16_buffer[i];
+      if (char_high_surrogate_p (code) && i < len - 1)
+        {
+          unichar low = utf16_buffer[i + 1];
+          if (char_low_surrogate_p (low))
+            {
+              code = surrogates_to_codepoint (low, code);
+              ++i;
+            }
+        }
       /* TODO: still need this? */
       if (code == 0x2DC)
         code = '~'; /* 0x7E */
@@ -6314,6 +6326,7 @@ - (void)insertText: (id)aString
       emacs_event->code = code;
       EV_TRAILER ((id)nil);
     }
+  SAFE_FREE ();
 }
 
 
-- 
2.15.1




^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-25 21:01 ` [PATCH] Allow inserting non-BMP characters Philipp Stephani
@ 2017-12-26  1:26   ` Alan Third
  2017-12-26  4:46   ` Eli Zaretskii
  1 sibling, 0 replies; 17+ messages in thread
From: Alan Third @ 2017-12-26  1:26 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: Philipp Stephani, emacs-devel

Well, that’s annoying. I just wrote nearly the same patch.

Yours is better.

On Mon, Dec 25, 2017 at 10:01:15PM +0100, Philipp Stephani wrote:
> * src/character.h (char_low_surrogate_p, char_high_surrogate_p)
> (surrogates_to_codepoint): New functions.
> 
> * src/nsterm.m (insertText:): Properly handle surrogate pairs.
<snip>
-- 
Alan Third



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-25 21:01 ` [PATCH] Allow inserting non-BMP characters Philipp Stephani
  2017-12-26  1:26   ` Alan Third
@ 2017-12-26  4:46   ` Eli Zaretskii
  2017-12-26 10:35     ` Philipp Stephani
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2017-12-26  4:46 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: phst, emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Mon, 25 Dec 2017 22:01:15 +0100
> Cc: Philipp Stephani <phst@google.com>
> 
> +/* Return the Unicode code point for the given UTF-16 surrogates.  */
> +
> +INLINE int
> +surrogates_to_codepoint (int low, int high)
> +{
> +  eassert (char_low_surrogate_p (low));
> +  eassert (char_high_surrogate_p (high));
> +  return 0x10000 + (low - 0xDC00) + ((high - 0xD800) * 0x400);
> +}
> +
>  /* Data type for Unicode general category.

Suggest to move surrogates_to_codepoint to coding.c, and then use the
macros UTF_16_HIGH_SURROGATE_P and UTF_16_LOW_SURROGATE_P defined
there.  Also, a single-liner sounds like too little to justify a
function, so maybe make all of that macros in coding.h, and include
the latter in nsterm.m.

> +  USE_SAFE_ALLOCA;
> +  unichar *utf16_buffer;
> +  SAFE_NALLOCA (utf16_buffer, 1, len);

Maximum length of a UTF-16 sequence is known in advance, so why do you
need SAFE_NALLOCA here?  Couldn't you use a buffer of fixed length
instead?

Thanks.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-26  4:46   ` Eli Zaretskii
@ 2017-12-26 10:35     ` Philipp Stephani
  2017-12-26 16:11       ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Stephani @ 2017-12-26 10:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: phst, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Di., 26. Dez. 2017 um 05:46 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Mon, 25 Dec 2017 22:01:15 +0100
> > Cc: Philipp Stephani <phst@google.com>
> >
> > +/* Return the Unicode code point for the given UTF-16 surrogates.  */
> > +
> > +INLINE int
> > +surrogates_to_codepoint (int low, int high)
> > +{
> > +  eassert (char_low_surrogate_p (low));
> > +  eassert (char_high_surrogate_p (high));
> > +  return 0x10000 + (low - 0xDC00) + ((high - 0xD800) * 0x400);
> > +}
> > +
> >  /* Data type for Unicode general category.
>
> Suggest to move surrogates_to_codepoint to coding.c, and then use the
> macros UTF_16_HIGH_SURROGATE_P and UTF_16_LOW_SURROGATE_P defined
> there.


Hmm, I'd rather go the other way round and remove these macros later. They
are macros, thus worse than functions, and don't seem to be correct either
(what about a value such as 0x11DC00?).


>   Also, a single-liner sounds like too little to justify a
> function, so maybe make all of that macros in coding.h, and include
> the latter in nsterm.m.
>

No new macros please if we can avoid it. Functions are strictly better.
I don't care much whether they are in character.h or coding.h, but
char_surrogate_p is already in character.h.


>
> > +  USE_SAFE_ALLOCA;
> > +  unichar *utf16_buffer;
> > +  SAFE_NALLOCA (utf16_buffer, 1, len);
>
> Maximum length of a UTF-16 sequence is known in advance, so why do you
> need SAFE_NALLOCA here?  Couldn't you use a buffer of fixed length
> instead?
>
>
The text being inserted can be arbitrarily long. Even single characters
(i.e. extended grapheme clusters) can be arbitrarily long.

[-- Attachment #2: Type: text/html, Size: 2532 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-26 10:35     ` Philipp Stephani
@ 2017-12-26 16:11       ` Eli Zaretskii
  2017-12-26 18:50         ` Philipp Stephani
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2017-12-26 16:11 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: phst, emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Tue, 26 Dec 2017 10:35:42 +0000
> Cc: emacs-devel@gnu.org, phst@google.com
> 
>  Suggest to move surrogates_to_codepoint to coding.c, and then use the
>  macros UTF_16_HIGH_SURROGATE_P and UTF_16_LOW_SURROGATE_P defined
>  there.
> 
> Hmm, I'd rather go the other way round and remove these macros later. They are macros, thus worse than
> functions,

I don't think we have a policy to prefer inline functions to macros,
and I don't think we should have such a policy.  We use inline
functions when that's necessary, but we don't in general prefer them.
They have their own problems, see the comments in lisp.h for some of
that.

> and don't seem to be correct either (what about a value such as 0x11DC00?).

??? They care correct for UTF-16 sequences, which are 16-bit numbers.
If you need to augment them by testing the high-order bits to be zero
in your case, that's okay, but I don't see any need for introducing
similar but different functionality.

> No new macros please if we can avoid it. Functions are strictly better.

Sorry, I disagree.  Each has its advantages, and on balance I find
macros to be slightly better, certainly not worse.  There's no need to
avoid them in C.

> I don't care much whether they are in character.h or coding.h, but char_surrogate_p is already in character.h.

char_surrogate_p should have used the coding.h macros as well.

>  > +  USE_SAFE_ALLOCA;
>  > +  unichar *utf16_buffer;
>  > +  SAFE_NALLOCA (utf16_buffer, 1, len);
> 
>  Maximum length of a UTF-16 sequence is known in advance, so why do you
>  need SAFE_NALLOCA here?  Couldn't you use a buffer of fixed length
>  instead?
> 
> The text being inserted can be arbitrarily long. Even single characters (i.e. extended grapheme clusters) can
> be arbitrarily long.

Yes, but why do you first copy the input into a separate buffer?  Why
not convert each UTF-16 sequence separately, as you go through the
loop?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-26 16:11       ` Eli Zaretskii
@ 2017-12-26 18:50         ` Philipp Stephani
  2017-12-26 20:22           ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Stephani @ 2017-12-26 18:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: phst, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 3856 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Di., 26. Dez. 2017 um 17:11 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Tue, 26 Dec 2017 10:35:42 +0000
> > Cc: emacs-devel@gnu.org, phst@google.com
> >
> >  Suggest to move surrogates_to_codepoint to coding.c, and then use the
> >  macros UTF_16_HIGH_SURROGATE_P and UTF_16_LOW_SURROGATE_P defined
> >  there.
> >
> > Hmm, I'd rather go the other way round and remove these macros later.
> They are macros, thus worse than
> > functions,
>
> I don't think we have a policy to prefer inline functions to macros,
> and I don't think we should have such a policy.  We use inline
> functions when that's necessary, but we don't in general prefer them.
> They have their own problems, see the comments in lisp.h for some of
> that.
>

Thanks, the only discussion I saw there was about some performance issues:

   Some operations are so commonly executed that they are implemented
   as macros, not functions, because otherwise runtime performance would
   suffer too much when compiling with GCC without optimization.

   FIXME: Remove the lisp_h_OP macros, and define just the inline OP
   functions, once "gcc -Og" (new to GCC 4.8) works well enough for
   Emacs developers.  Maybe in the year 2020.  See Bug#11935.

That bug says that GCC 4.8 should be available in Debian stable to support
-Og. Debian stable now already has 6.3 (
https://wiki.debian.org/DebianStretch#Packages_.26_versions). So maybe it's
time to compile development versions with -Og and try to reapply the
original patch. 5 years have passed, and compilers have gotten a lot better.

In any case, the new functions are certainly not commonly executed. They
are currently only executed when converting non-BMP keystrokes on macOS,
which is rare enough.

>
> > and don't seem to be correct either (what about a value such as
> 0x11DC00?).
>
> ??? They care correct for UTF-16 sequences, which are 16-bit numbers.
> If you need to augment them by testing the high-order bits to be zero
> in your case, that's okay, but I don't see any need for introducing
> similar but different functionality.
>

I'd be OK with using the macros since they already exist, but I wouldn't
want to touch them without converting them to functions first, and for
using them in nsterm.m I'd have to move them around.

>
> > No new macros please if we can avoid it. Functions are strictly better.
>
> Sorry, I disagree.  Each has its advantages, and on balance I find
> macros to be slightly better, certainly not worse.  There's no need to
> avoid them in C.
>

I disagree, see e.g. https://gcc.gnu.org/onlinedocs/cpp/Macro-Pitfalls.html and
many other sources.
Sometimes macros are unavoidable, but not here.

>
> > I don't care much whether they are in character.h or coding.h, but
> char_surrogate_p is already in character.h.
>
> char_surrogate_p should have used the coding.h macros as well.
>
> >  > +  USE_SAFE_ALLOCA;
> >  > +  unichar *utf16_buffer;
> >  > +  SAFE_NALLOCA (utf16_buffer, 1, len);
> >
> >  Maximum length of a UTF-16 sequence is known in advance, so why do you
> >  need SAFE_NALLOCA here?  Couldn't you use a buffer of fixed length
> >  instead?
> >
> > The text being inserted can be arbitrarily long. Even single characters
> (i.e. extended grapheme clusters) can
> > be arbitrarily long.
>
> Yes, but why do you first copy the input into a separate buffer?  Why
> not convert each UTF-16 sequence separately, as you go through the
> loop?
>

Message (method) invocations in Objective-C have high overhead because they
are late-bound. Therefore it is advisable to minimize the number of
messages sent.
https://developer.apple.com/documentation/foundation/nsstring/1408720-getcharacters?language=objc
also
indicates that a (properly implemented) getCharacters call is faster than
calling characterAtIndex in a loop.

[-- Attachment #2: Type: text/html, Size: 5500 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-26 18:50         ` Philipp Stephani
@ 2017-12-26 20:22           ` Eli Zaretskii
  2017-12-26 21:36             ` Alan Third
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2017-12-26 20:22 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: phst, emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Tue, 26 Dec 2017 18:50:46 +0000
> Cc: emacs-devel@gnu.org, phst@google.com
> 
>  I don't think we have a policy to prefer inline functions to macros,
>  and I don't think we should have such a policy.  We use inline
>  functions when that's necessary, but we don't in general prefer them.
>  They have their own problems, see the comments in lisp.h for some of
>  that.
> 
> Thanks, the only discussion I saw there was about some performance issues:

Let me make it more clear:

Macros are faster in non-optimized builds (that's why we have such a
complex setup with them in lisp.h).

Macros are also better for debugging, especially when you debug a core
file.  Invoking a function when debugging needs a running process, so
it cannot be done when debugging a core file.  And sometimes the
compiler doesn't keep a non-inlined version of an inline function, so
it cannot be called from the debugger at all.

On the downside, macros can be less readable when they are complex,
and have additional problems when you need local variables.  (None of
that is relevant to the issue at hand.)

>  > and don't seem to be correct either (what about a value such as 0x11DC00?).
> 
>  ??? They care correct for UTF-16 sequences, which are 16-bit numbers.
>  If you need to augment them by testing the high-order bits to be zero
>  in your case, that's okay, but I don't see any need for introducing
>  similar but different functionality.
> 
> I'd be OK with using the macros since they already exist, but I wouldn't want to touch them without converting
> them to functions first, and for using them in nsterm.m I'd have to move them around.

You don't need to convert the macros to anything, just add a test that
you need, as in

   if (c < 0xFFFF && UTF_16_HIGH_SURROGATE_P (c))
     ...

>  > No new macros please if we can avoid it. Functions are strictly better.
> 
>  Sorry, I disagree.  Each has its advantages, and on balance I find
>  macros to be slightly better, certainly not worse.  There's no need to
>  avoid them in C.
> 
> I disagree, see e.g. https://gcc.gnu.org/onlinedocs/cpp/Macro-Pitfalls.html and many other sources.
> Sometimes macros are unavoidable, but not here.

See above.

>  Yes, but why do you first copy the input into a separate buffer?  Why
>  not convert each UTF-16 sequence separately, as you go through the
>  loop?
> 
> Message (method) invocations in Objective-C have high overhead because they are late-bound. Therefore it is
> advisable to minimize the number of messages sent.
> https://developer.apple.com/documentation/foundation/nsstring/1408720-getcharacters?language=objc also
> indicates that a (properly implemented) getCharacters call is faster than calling characterAtIndex in a loop.

Is that a fact, or should we measure that?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-26 20:22           ` Eli Zaretskii
@ 2017-12-26 21:36             ` Alan Third
  2017-12-27  3:41               ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Third @ 2017-12-26 21:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: phst, Philipp Stephani, emacs-devel

On Tue, Dec 26, 2017 at 10:22:36PM +0200, Eli Zaretskii wrote:
> > Message (method) invocations in Objective-C have high overhead because they are late-bound. Therefore it is
> > advisable to minimize the number of messages sent.
> > https://developer.apple.com/documentation/foundation/nsstring/1408720-getcharacters?language=objc also
> > indicates that a (properly implemented) getCharacters call is faster than calling characterAtIndex in a loop.
> 
> Is that a fact, or should we measure that?

Apple’s documentation says to put it in a buffer:

    If you want to iterate over the characters of a string, one of the
    things you should not do is use the characterAtIndex: method to
    retrieve each character separately. This method is not designed for
    repeated access. Instead, consider fetching the characters all at once
    using the getCharacters:range: method and iterating over the bytes
    directly.

    https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/CocoaPerformance/Articles/StringDrawing.html#//apple_ref/doc/uid/TP40001445-112378

-- 
Alan Third



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-26 21:36             ` Alan Third
@ 2017-12-27  3:41               ` Eli Zaretskii
  2017-12-28 11:38                 ` Alan Third
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2017-12-27  3:41 UTC (permalink / raw)
  To: Alan Third; +Cc: phst, p.stephani2, emacs-devel

> Date: Tue, 26 Dec 2017 21:36:42 +0000
> From: Alan Third <alan@idiocy.org>
> Cc: Philipp Stephani <p.stephani2@gmail.com>, phst@google.com,
> 	emacs-devel@gnu.org
> 
> Apple’s documentation says to put it in a buffer:
> 
>     If you want to iterate over the characters of a string, one of the
>     things you should not do is use the characterAtIndex: method to
>     retrieve each character separately. This method is not designed for
>     repeated access. Instead, consider fetching the characters all at once
>     using the getCharacters:range: method and iterating over the bytes
>     directly.
> 
>     https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/CocoaPerformance/Articles/StringDrawing.html#//apple_ref/doc/uid/TP40001445-112378

What about the possibility that SAFE_NALLOCA could signal an error and
longjmp to top level?  Does this code always run in the main thread,
and if so, can it allow such longjmp's?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-27  3:41               ` Eli Zaretskii
@ 2017-12-28 11:38                 ` Alan Third
  2017-12-28 12:31                   ` Philipp Stephani
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Third @ 2017-12-28 11:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: phst, p.stephani2, emacs-devel

On Wed, Dec 27, 2017 at 05:41:23AM +0200, Eli Zaretskii wrote:
> What about the possibility that SAFE_NALLOCA could signal an error and
> longjmp to top level?  Does this code always run in the main thread,
> and if so, can it allow such longjmp's?

I think it does always run in the main thread, however since it runs
within the NSApplication run loop I’ve no idea what would happen if we
did a longjmp.

-- 
Alan Third



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-28 11:38                 ` Alan Third
@ 2017-12-28 12:31                   ` Philipp Stephani
  2017-12-28 16:29                     ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Stephani @ 2017-12-28 12:31 UTC (permalink / raw)
  To: Alan Third; +Cc: phst, Eli Zaretskii, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 645 bytes --]

Alan Third <alan@idiocy.org> schrieb am Do., 28. Dez. 2017 um 12:38 Uhr:

> On Wed, Dec 27, 2017 at 05:41:23AM +0200, Eli Zaretskii wrote:
> > What about the possibility that SAFE_NALLOCA could signal an error and
> > longjmp to top level?  Does this code always run in the main thread,
> > and if so, can it allow such longjmp's?
>
> I think it does always run in the main thread, however since it runs
> within the NSApplication run loop I’ve no idea what would happen if we
> did a longjmp.
>
>
We should probably avoid longjmps here. This message is invoked by the
window manager, which most likely can't deal with longjmps.

[-- Attachment #2: Type: text/html, Size: 945 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-28 12:31                   ` Philipp Stephani
@ 2017-12-28 16:29                     ` Eli Zaretskii
  2017-12-29 20:14                       ` Philipp Stephani
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2017-12-28 16:29 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: phst, alan, emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Thu, 28 Dec 2017 12:31:39 +0000
> Cc: phst@google.com, Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org
> 
> Alan Third <alan@idiocy.org> schrieb am Do., 28. Dez. 2017 um 12:38 Uhr:
> 
>  On Wed, Dec 27, 2017 at 05:41:23AM +0200, Eli Zaretskii wrote:
>  > What about the possibility that SAFE_NALLOCA could signal an error and
>  > longjmp to top level?  Does this code always run in the main thread,
>  > and if so, can it allow such longjmp's?
> 
>  I think it does always run in the main thread, however since it runs
>  within the NSApplication run loop I’ve no idea what would happen if we
>  did a longjmp.
> 
> We should probably avoid longjmps here. This message is invoked by the window manager, which most likely
> can't deal with longjmps. 

Then maybe process the input in chunks using a fixed-size buffer?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-28 16:29                     ` Eli Zaretskii
@ 2017-12-29 20:14                       ` Philipp Stephani
  2017-12-29 20:27                         ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Stephani @ 2017-12-29 20:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: phst, alan, emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 1229 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Do., 28. Dez. 2017 um 17:29 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Thu, 28 Dec 2017 12:31:39 +0000
> > Cc: phst@google.com, Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org
> >
> > Alan Third <alan@idiocy.org> schrieb am Do., 28. Dez. 2017 um 12:38 Uhr:
> >
> >  On Wed, Dec 27, 2017 at 05:41:23AM +0200, Eli Zaretskii wrote:
> >  > What about the possibility that SAFE_NALLOCA could signal an error and
> >  > longjmp to top level?  Does this code always run in the main thread,
> >  > and if so, can it allow such longjmp's?
> >
> >  I think it does always run in the main thread, however since it runs
> >  within the NSApplication run loop I’ve no idea what would happen if we
> >  did a longjmp.
> >
> > We should probably avoid longjmps here. This message is invoked by the
> window manager, which most likely
> > can't deal with longjmps.
>
> Then maybe process the input in chunks using a fixed-size buffer?
>

Maybe, but for now I've reverted to just call characterAtIndex repeatedly.
I guess in 99% of cases we insert a single code unit anyway, and if we plan
to make the code more complex we should benchmark it before.

[-- Attachment #1.2: Type: text/html, Size: 1891 bytes --]

[-- Attachment #2: 0001-Allow-inserting-non-BMP-characters.txt --]
[-- Type: text/plain, Size: 4797 bytes --]

From 383e6fe7c50c1e3a96720284fa5fc2c00610959f Mon Sep 17 00:00:00 2001
From: Philipp Stephani <phst@google.com>
Date: Mon, 25 Dec 2017 22:00:00 +0100
Subject: [PATCH] Allow inserting non-BMP characters

* src/coding.h (char_low_surrogate_p, char_high_surrogate_p)
(surrogates_to_codepoint): New functions.

* src/coding.c (decode_coding_utf_16): Use new inline functions;
remove old macros.

* src/nsterm.m (insertText:): Properly handle surrogate pairs.
---
 src/coding.c | 13 +++----------
 src/coding.h | 26 ++++++++++++++++++++++++++
 src/nsterm.m | 25 +++++++++++++++++++------
 3 files changed, 48 insertions(+), 16 deletions(-)

diff --git a/src/coding.c b/src/coding.c
index 1705838ffa..9903d87b92 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -1515,13 +1515,6 @@ encode_coding_utf_8 (struct coding_system *coding)
 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
    Return true if a text is encoded in one of UTF-16 based coding systems.  */
 
-#define UTF_16_HIGH_SURROGATE_P(val) \
-  (((val) & 0xFC00) == 0xD800)
-
-#define UTF_16_LOW_SURROGATE_P(val) \
-  (((val) & 0xFC00) == 0xDC00)
-
-
 static bool
 detect_coding_utf_16 (struct coding_system *coding,
 		      struct coding_detection_info *detect_info)
@@ -1686,7 +1679,7 @@ decode_coding_utf_16 (struct coding_system *coding)
 
       if (surrogate)
 	{
-	  if (! UTF_16_LOW_SURROGATE_P (c))
+	  if (! char_low_surrogate_p (c))
 	    {
 	      if (endian == utf_16_big_endian)
 		c1 = surrogate >> 8, c2 = surrogate & 0xFF;
@@ -1694,7 +1687,7 @@ decode_coding_utf_16 (struct coding_system *coding)
 		c1 = surrogate & 0xFF, c2 = surrogate >> 8;
 	      *charbuf++ = c1;
 	      *charbuf++ = c2;
-	      if (UTF_16_HIGH_SURROGATE_P (c))
+	      if (char_high_surrogate_p (c))
 		CODING_UTF_16_SURROGATE (coding) = surrogate = c;
 	      else
 		*charbuf++ = c;
@@ -1708,7 +1701,7 @@ decode_coding_utf_16 (struct coding_system *coding)
 	}
       else
 	{
-	  if (UTF_16_HIGH_SURROGATE_P (c))
+	  if (char_high_surrogate_p (c))
 	    CODING_UTF_16_SURROGATE (coding) = surrogate = c;
 	  else
 	    {
diff --git a/src/coding.h b/src/coding.h
index 66d125b07e..b111ce70c6 100644
--- a/src/coding.h
+++ b/src/coding.h
@@ -662,6 +662,32 @@ struct coding_system
 /* Note that this encodes utf-8, not utf-8-emacs, so it's not a no-op.  */
 #define ENCODE_UTF_8(str) code_convert_string_norecord (str, Qutf_8, true)
 
+/* Return true if C is a low surrogate.  */
+
+INLINE bool
+char_low_surrogate_p (int c)
+{
+  return 0xDC00 <= c && c <= 0xDFFF;
+}
+
+/* Return true if C is a high surrogate.  */
+
+INLINE bool
+char_high_surrogate_p (int c)
+{
+  return 0xD800 <= c && c <= 0xDBFF;
+}
+
+/* Return the Unicode code point for the given UTF-16 surrogates.  */
+
+INLINE int
+surrogates_to_codepoint (int low, int high)
+{
+  eassert (char_low_surrogate_p (low));
+  eassert (char_high_surrogate_p (high));
+  return 0x10000 + (low - 0xDC00) + ((high - 0xD800) * 0x400);
+}
+
 /* Extern declarations.  */
 extern Lisp_Object code_conversion_save (bool, bool);
 extern bool encode_coding_utf_8 (struct coding_system *);
diff --git a/src/nsterm.m b/src/nsterm.m
index 07ac8f978f..0c0b3fd9b5 100644
--- a/src/nsterm.m
+++ b/src/nsterm.m
@@ -6283,14 +6283,13 @@ flag set (this is probably a bug in the OS).
          by doCommandBySelector: deleteBackward: */
 - (void)insertText: (id)aString
 {
-  int code;
-  int len = [(NSString *)aString length];
-  int i;
+  NSString *s = aString;
+  NSUInteger len = [s length];
 
   NSTRACE ("[EmacsView insertText:]");
 
   if (NS_KEYLOG)
-    NSLog (@"insertText '%@'\tlen = %d", aString, len);
+    NSLog (@"insertText '%@'\tlen = %lu", aString, (unsigned long) len);
   processingCompose = NO;
 
   if (!emacs_event)
@@ -6300,10 +6299,24 @@ - (void)insertText: (id)aString
   if (workingText != nil)
     [self deleteWorkingText];
 
+  /* It might be preferable to use getCharacters:range: below,
+     cf. https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/CocoaPerformance/Articles/StringDrawing.html#//apple_ref/doc/uid/TP40001445-112378.
+     However, we probably can't use SAFE_NALLOCA here because it might
+     exit nonlocally.  */
+
   /* now insert the string as keystrokes */
-  for (i =0; i<len; i++)
+  for (NSUInteger i = 0; i < len; i++)
     {
-      code = [aString characterAtIndex: i];
+      NSUInteger code = [s characterAtIndex:i];
+      if (char_high_surrogate_p (code) && i < len - 1)
+        {
+          unichar low = [s characterAtIndex:i + 1];
+          if (char_low_surrogate_p (low))
+            {
+              code = surrogates_to_codepoint (low, code);
+              ++i;
+            }
+        }
       /* TODO: still need this? */
       if (code == 0x2DC)
         code = '~'; /* 0x7E */
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-29 20:14                       ` Philipp Stephani
@ 2017-12-29 20:27                         ` Eli Zaretskii
  2018-01-07 15:51                           ` Philipp Stephani
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2017-12-29 20:27 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: phst, alan, emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Fri, 29 Dec 2017 20:14:41 +0000
> Cc: alan@idiocy.org, phst@google.com, emacs-devel@gnu.org
> 
> diff --git a/src/coding.c b/src/coding.c
> index 1705838ffa..9903d87b92 100644
> --- a/src/coding.c
> +++ b/src/coding.c
> @@ -1515,13 +1515,6 @@ encode_coding_utf_8 (struct coding_system *coding)
>  /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
>     Return true if a text is encoded in one of UTF-16 based coding systems.  */
>  
> -#define UTF_16_HIGH_SURROGATE_P(val) \
> -  (((val) & 0xFC00) == 0xD800)
> -
> -#define UTF_16_LOW_SURROGATE_P(val) \
> -  (((val) & 0xFC00) == 0xDC00)
> -
> -
>  static bool
>  detect_coding_utf_16 (struct coding_system *coding,
>  		      struct coding_detection_info *detect_info)
> @@ -1686,7 +1679,7 @@ decode_coding_utf_16 (struct coding_system *coding)
>  
>        if (surrogate)
>  	{
> -	  if (! UTF_16_LOW_SURROGATE_P (c))
> +	  if (! char_low_surrogate_p (c))

Please don't.  This makes decoding UTF-16 a tad slower for no good
reason (more than a tad in unoptimized builds).

(Why can't you just accept what I'm saying?  Don't my position and my
experience mean anything to you?)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2017-12-29 20:27                         ` Eli Zaretskii
@ 2018-01-07 15:51                           ` Philipp Stephani
  2018-01-07 17:40                             ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Philipp Stephani @ 2018-01-07 15:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: phst, alan, emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 1301 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Fr., 29. Dez. 2017 um 21:27 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Fri, 29 Dec 2017 20:14:41 +0000
> > Cc: alan@idiocy.org, phst@google.com, emacs-devel@gnu.org
> >
> > diff --git a/src/coding.c b/src/coding.c
> > index 1705838ffa..9903d87b92 100644
> > --- a/src/coding.c
> > +++ b/src/coding.c
> > @@ -1515,13 +1515,6 @@ encode_coding_utf_8 (struct coding_system *coding)
> >  /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
> >     Return true if a text is encoded in one of UTF-16 based coding
> systems.  */
> >
> > -#define UTF_16_HIGH_SURROGATE_P(val) \
> > -  (((val) & 0xFC00) == 0xD800)
> > -
> > -#define UTF_16_LOW_SURROGATE_P(val) \
> > -  (((val) & 0xFC00) == 0xDC00)
> > -
> > -
> >  static bool
> >  detect_coding_utf_16 (struct coding_system *coding,
> >                     struct coding_detection_info *detect_info)
> > @@ -1686,7 +1679,7 @@ decode_coding_utf_16 (struct coding_system
> *coding)
> >
> >        if (surrogate)
> >       {
> > -       if (! UTF_16_LOW_SURROGATE_P (c))
> > +       if (! char_low_surrogate_p (c))
>
> Please don't.  This makes decoding UTF-16 a tad slower for no good
> reason (more than a tad in unoptimized builds).
>
>
OK, here's a new version of the patch.

[-- Attachment #1.2: Type: text/html, Size: 2093 bytes --]

[-- Attachment #2: 0001-Allow-inserting-non-BMP-characters.txt --]
[-- Type: text/plain, Size: 4030 bytes --]

From aa1d5b2600c5ef0a76c8525462ea88db2a2fa25c Mon Sep 17 00:00:00 2001
From: Philipp Stephani <phst@google.com>
Date: Mon, 25 Dec 2017 22:00:00 +0100
Subject: [PATCH] Allow inserting non-BMP characters

* src/coding.h (UTF_16_HIGH_SURROGATE_P, UTF_16_LOW_SURROGATE_P): Move
from coding.c and document.
(surrogates_to_codepoint): New function.

* src/nsterm.m (insertText:): Properly handle surrogate pairs.
---
 src/coding.c |  7 -------
 src/coding.h | 24 ++++++++++++++++++++++++
 src/nsterm.m | 25 +++++++++++++++++++------
 3 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/src/coding.c b/src/coding.c
index d8bc525026..da62540344 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -1515,13 +1515,6 @@ encode_coding_utf_8 (struct coding_system *coding)
 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
    Return true if a text is encoded in one of UTF-16 based coding systems.  */
 
-#define UTF_16_HIGH_SURROGATE_P(val) \
-  (((val) & 0xFC00) == 0xD800)
-
-#define UTF_16_LOW_SURROGATE_P(val) \
-  (((val) & 0xFC00) == 0xDC00)
-
-
 static bool
 detect_coding_utf_16 (struct coding_system *coding,
 		      struct coding_detection_info *detect_info)
diff --git a/src/coding.h b/src/coding.h
index 54100ccd31..d90b799d76 100644
--- a/src/coding.h
+++ b/src/coding.h
@@ -662,6 +662,30 @@ struct coding_system
 /* Note that this encodes utf-8, not utf-8-emacs, so it's not a no-op.  */
 #define ENCODE_UTF_8(str) code_convert_string_norecord (str, Qutf_8, true)
 
+/* Return true if VAL is a high surrogate.  VAL must be a 16-bit code
+   unit.  */
+
+#define UTF_16_HIGH_SURROGATE_P(val) \
+  (((val) & 0xFC00) == 0xD800)
+
+/* Return true if VAL is a low surrogate.  VAL must be a 16-bit code
+   unit.  */
+
+#define UTF_16_LOW_SURROGATE_P(val) \
+  (((val) & 0xFC00) == 0xDC00)
+
+/* Return the Unicode code point for the given UTF-16 surrogates.  */
+
+INLINE int
+surrogates_to_codepoint (int low, int high)
+{
+  eassert (0 <= low && low <= 0xFFFF);
+  eassert (0 <= high && high <= 0xFFFF);
+  eassert (UTF_16_LOW_SURROGATE_P (low));
+  eassert (UTF_16_HIGH_SURROGATE_P (high));
+  return 0x10000 + (low - 0xDC00) + ((high - 0xD800) * 0x400);
+}
+
 /* Extern declarations.  */
 extern Lisp_Object code_conversion_save (bool, bool);
 extern bool encode_coding_utf_8 (struct coding_system *);
diff --git a/src/nsterm.m b/src/nsterm.m
index dd5c7d91ea..3f59e33c7b 100644
--- a/src/nsterm.m
+++ b/src/nsterm.m
@@ -6283,14 +6283,13 @@ flag set (this is probably a bug in the OS).
          by doCommandBySelector: deleteBackward: */
 - (void)insertText: (id)aString
 {
-  int code;
-  int len = [(NSString *)aString length];
-  int i;
+  NSString *s = aString;
+  NSUInteger len = [s length];
 
   NSTRACE ("[EmacsView insertText:]");
 
   if (NS_KEYLOG)
-    NSLog (@"insertText '%@'\tlen = %d", aString, len);
+    NSLog (@"insertText '%@'\tlen = %lu", aString, (unsigned long) len);
   processingCompose = NO;
 
   if (!emacs_event)
@@ -6300,10 +6299,24 @@ - (void)insertText: (id)aString
   if (workingText != nil)
     [self deleteWorkingText];
 
+  /* It might be preferable to use getCharacters:range: below,
+     cf. https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/CocoaPerformance/Articles/StringDrawing.html#//apple_ref/doc/uid/TP40001445-112378.
+     However, we probably can't use SAFE_NALLOCA here because it might
+     exit nonlocally.  */
+
   /* now insert the string as keystrokes */
-  for (i =0; i<len; i++)
+  for (NSUInteger i = 0; i < len; i++)
     {
-      code = [aString characterAtIndex: i];
+      NSUInteger code = [s characterAtIndex:i];
+      if (UTF_16_HIGH_SURROGATE_P (code) && i < len - 1)
+        {
+          unichar low = [s characterAtIndex:i + 1];
+          if (UTF_16_LOW_SURROGATE_P (low))
+            {
+              code = surrogates_to_codepoint (low, code);
+              ++i;
+            }
+        }
       /* TODO: still need this? */
       if (code == 0x2DC)
         code = '~'; /* 0x7E */
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2018-01-07 15:51                           ` Philipp Stephani
@ 2018-01-07 17:40                             ` Eli Zaretskii
  2018-01-07 18:44                               ` Philipp Stephani
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2018-01-07 17:40 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: phst, alan, emacs-devel

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sun, 07 Jan 2018 15:51:55 +0000
> Cc: alan@idiocy.org, phst@google.com, emacs-devel@gnu.org
> 
> From aa1d5b2600c5ef0a76c8525462ea88db2a2fa25c Mon Sep 17 00:00:00 2001
> From: Philipp Stephani <phst@google.com>
> Date: Mon, 25 Dec 2017 22:00:00 +0100
> Subject: [PATCH] Allow inserting non-BMP characters
> 
> * src/coding.h (UTF_16_HIGH_SURROGATE_P, UTF_16_LOW_SURROGATE_P): Move
> from coding.c and document.
> (surrogates_to_codepoint): New function.
> 
> * src/nsterm.m (insertText:): Properly handle surrogate pairs.

This is fine with me, thanks.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Allow inserting non-BMP characters
  2018-01-07 17:40                             ` Eli Zaretskii
@ 2018-01-07 18:44                               ` Philipp Stephani
  0 siblings, 0 replies; 17+ messages in thread
From: Philipp Stephani @ 2018-01-07 18:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: phst, alan, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 754 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am So., 7. Jan. 2018 um 18:41 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sun, 07 Jan 2018 15:51:55 +0000
> > Cc: alan@idiocy.org, phst@google.com, emacs-devel@gnu.org
> >
> > From aa1d5b2600c5ef0a76c8525462ea88db2a2fa25c Mon Sep 17 00:00:00 2001
> > From: Philipp Stephani <phst@google.com>
> > Date: Mon, 25 Dec 2017 22:00:00 +0100
> > Subject: [PATCH] Allow inserting non-BMP characters
> >
> > * src/coding.h (UTF_16_HIGH_SURROGATE_P, UTF_16_LOW_SURROGATE_P): Move
> > from coding.c and document.
> > (surrogates_to_codepoint): New function.
> >
> > * src/nsterm.m (insertText:): Properly handle surrogate pairs.
>
> This is fine with me, thanks.
>

Thanks, pushed to master as 703ac3ea1c.

[-- Attachment #2: Type: text/html, Size: 1407 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-01-07 18:44 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAArVCkRx8p_vaFKJ_kXRuoZCKVBSYr=94RJANGpU0NXvkEZv6A@mail.gmail.com>
2017-12-25 21:01 ` [PATCH] Allow inserting non-BMP characters Philipp Stephani
2017-12-26  1:26   ` Alan Third
2017-12-26  4:46   ` Eli Zaretskii
2017-12-26 10:35     ` Philipp Stephani
2017-12-26 16:11       ` Eli Zaretskii
2017-12-26 18:50         ` Philipp Stephani
2017-12-26 20:22           ` Eli Zaretskii
2017-12-26 21:36             ` Alan Third
2017-12-27  3:41               ` Eli Zaretskii
2017-12-28 11:38                 ` Alan Third
2017-12-28 12:31                   ` Philipp Stephani
2017-12-28 16:29                     ` Eli Zaretskii
2017-12-29 20:14                       ` Philipp Stephani
2017-12-29 20:27                         ` Eli Zaretskii
2018-01-07 15:51                           ` Philipp Stephani
2018-01-07 17:40                             ` Eli Zaretskii
2018-01-07 18:44                               ` Philipp Stephani

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).