bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
@ 2020-08-17 14:11 Mattias Engdegård
  2020-08-17 14:54 ` Andrii Kolomoiets
  2020-08-17 15:55 ` Eli Zaretskii
  0 siblings, 2 replies; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-17 14:11 UTC (permalink / raw)
  To: 42904; +Cc: Alan Third

[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]

Setting a frame title that contains non-Unicode characters causes a crash in the NS backend. (Other platforms may or may not deal with it appropriately -- if you have the opportunity to test, please report.)

Since the title is typically derived from the buffer name, this is easily reproduced by

 (rename-buffer "n\351")

The crash occurs in ns_set_name_internal:

  encoded_name = ENCODE_UTF_8 (name);

Here encoded_name is still "n\351" (a 2 byte unibyte string), because the \351 couldn't be encoded.

  str = [NSString stringWithUTF8String: SSDATA (encoded_name)];

Now str is nil since "n\351" isn't valid UTF-8.

    [[view window] setTitle: str];

Here we get an NS crash because nil isn't a valid setTitle: argument.

Proposed patch attached. I didn't find any obvious way to encode an Emacs string into valid UTF-8 (with bad parts replaced) so a new function was written. The corresponding Lisp function was marked internal because it's only there for test purposes, but it could of course be promoted to non-internal if someone wants it.


[-- Attachment #2: 0001-Fix-NS-crash-on-invalid-frame-title-string.patch --]
[-- Type: application/octet-stream, Size: 5805 bytes --]

From 13b43b826a7f7f539484babc275cd9a19a64da9e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 17 Aug 2020 15:37:33 +0200
Subject: [PATCH] Fix NS crash on invalid frame title string

Instead of blindly assuming that Emacs strings are valid UTF-8, which
they are not, convert them in a more careful way using U+FFFD for
replacing invalid values.

* src/coding.c (string_to_valid_utf_8)
Finternal_string_to_valid_utf_8): New functions.
* src/coding.h: Prototype.
* src/nsfns.m (ns_set_name_internal): Use string_to_valid_utf_8.
* test/src/coding-tests.el (coding-string-to-valid-utf-8): New test.
---
 src/coding.c             | 56 ++++++++++++++++++++++++++++++++++++++++
 src/coding.h             |  2 ++
 src/nsfns.m              |  6 ++---
 test/src/coding-tests.el | 14 ++++++++++
 4 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/src/coding.c b/src/coding.c
index 51bd441de9..65493b07ac 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -9564,6 +9564,61 @@ code_convert_string_norecord (Lisp_Object string, Lisp_Object coding_system,
 }
 
 
+/* Convert STRING to a pure Unicode string.
+   Non-Unicode values are substituted with U+FFFD REPLACEMENT CHARACTER.
+   Return a unibyte or multibyte string, possibly STRING itself,
+   whose SDATA is guaranteed to be UTF-8.  */
+Lisp_Object
+string_to_valid_utf_8 (Lisp_Object string)
+{
+  if (string_ascii_p (string))
+    return string;
+  if (!STRING_MULTIBYTE (string))
+    string = string_to_multibyte (string);
+
+  /* Now STRING is multibyte.  */
+  unsigned char *buf = NULL;
+  unsigned char *d = NULL;
+  unsigned char *s = SDATA (string);
+  unsigned char *end = s + SBYTES (string);
+  while (s < end)
+    {
+      int len;
+      int c = string_char_and_length (s, &len);
+      if (c > 0x10ffff || char_surrogate_p (c))
+        {
+          /* Not valid for UTF-8.  */
+          if (!d)
+            {
+              buf = xmalloc (4 * SCHARS (string));
+              ptrdiff_t n = s - SDATA (string);
+              memcpy (buf, SDATA (string), n);
+              d = buf + n;
+            }
+          *d++ = 0357;          /* Use U+FFFD.  */
+          *d++ = 0277;
+          *d++ = 0275;
+          s += len;
+        }
+      else if (d)
+        do *d++ = *s++; while (--len);
+      else
+        s += len;
+    }
+  Lisp_Object ret = buf ? make_multibyte_string (buf, SCHARS (string), d - buf)
+                        : string;
+  xfree (buf);
+  return ret;
+}
+
+DEFUN ("internal-string-to-valid-utf-8", Finternal_string_to_valid_utf_8,
+       Sinternal_string_to_valid_utf_8, 1, 1, 0,
+       doc:  /* Internal use only.  */)
+     (Lisp_Object string)
+{
+  return string_to_valid_utf_8 (string);
+}
+
 /* Return the gap address of BUFFER.  If the gap size is less than
    NBYTES, enlarge the gap in advance.  */
 
@@ -11811,6 +11866,7 @@ syms_of_coding (void)
   defsubr (&Scoding_system_aliases);
   defsubr (&Scoding_system_eol_type);
   defsubr (&Scoding_system_priority_list);
+  defsubr (&Sinternal_string_to_valid_utf_8);
 
   DEFVAR_LISP ("coding-system-list", Vcoding_system_list,
 	       doc: /* List of coding systems.
diff --git a/src/coding.h b/src/coding.h
index c2a7b2a00f..98f00a1731 100644
--- a/src/coding.h
+++ b/src/coding.h
@@ -709,6 +709,8 @@ #define UTF_16_LOW_SURROGATE_P(val) \
 extern void encode_coding_object (struct coding_system *,
                                   Lisp_Object, ptrdiff_t, ptrdiff_t,
                                   ptrdiff_t, ptrdiff_t, Lisp_Object);
+extern Lisp_Object string_to_valid_utf_8 (Lisp_Object);
+
 /* Defined in this file.  */
 INLINE int surrogates_to_codepoint (int, int);
 
diff --git a/src/nsfns.m b/src/nsfns.m
index 628233ea0d..3e84568991 100644
--- a/src/nsfns.m
+++ b/src/nsfns.m
@@ -405,9 +405,7 @@ Turn the input menu (an NSMenu) into a lisp list for tracking on lisp side.
   NSString *str;
   NSView *view = FRAME_NS_VIEW (f);
 
-
-  encoded_name = ENCODE_UTF_8 (name);
-
+  encoded_name = string_to_valid_utf_8 (name);
   str = [NSString stringWithUTF8String: SSDATA (encoded_name)];
 
 
@@ -418,7 +416,7 @@ Turn the input menu (an NSMenu) into a lisp list for tracking on lisp side.
   if (!STRINGP (f->icon_name))
     encoded_icon_name = encoded_name;
   else
-    encoded_icon_name = ENCODE_UTF_8 (f->icon_name);
+    encoded_icon_name = string_to_valid_utf_8 (f->icon_name);
 
   str = [NSString stringWithUTF8String: SSDATA (encoded_icon_name)];
 
diff --git a/test/src/coding-tests.el b/test/src/coding-tests.el
index c438ae22ce..f53f63eb48 100644
--- a/test/src/coding-tests.el
+++ b/test/src/coding-tests.el
@@ -429,6 +429,20 @@ coding-check-coding-systems-region
                  '((iso-latin-1 3) (us-ascii 1 3))))
   (should-error (check-coding-systems-region "å" nil '(bad-coding-system))))
 
+(ert-deftest coding-string-to-valid-utf-8 ()
+  (let ((empty "")
+        (valid-uni "Alpha")
+        (valid-multi "m\001ü∫𝔻"))
+    (should (eq (internal-string-to-valid-utf-8 empty) empty))
+    (should (eq (internal-string-to-valid-utf-8 valid-uni) valid-uni))
+    (should (eq (internal-string-to-valid-utf-8 valid-multi) valid-multi)))
+  (should (equal (internal-string-to-valid-utf-8 "unpaired\ud9a3surrogate")
+                 "unpaired\ufffdsurrogate"))
+  (should (equal (internal-string-to-valid-utf-8 "raw\200\377bytes")
+                 "raw\ufffd\ufffdbytes"))
+  (should (equal (internal-string-to-valid-utf-8 "all§\300at\udffeonce")
+                 "all§\ufffdat\ufffdonce")))
+
 ;; Local Variables:
 ;; byte-compile-warnings: (not obsolete)
 ;; End:
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-17 14:11 bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS Mattias Engdegård
@ 2020-08-17 14:54 ` Andrii Kolomoiets
  2020-08-17 15:55   ` Mattias Engdegård
  2020-08-17 15:55 ` Eli Zaretskii
  1 sibling, 1 reply; 29+ messages in thread
From: Andrii Kolomoiets @ 2020-08-17 14:54 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, Alan Third

Mattias Engdegård <mattiase@acm.org> writes:

> Setting a frame title that contains non-Unicode characters causes a crash in the NS backend. (Other platforms may or may not deal with it appropriately -- if you have the opportunity to test, please report.)
>
> Since the title is typically derived from the buffer name, this is easily reproduced by
>
>  (rename-buffer "n\351")

Looks like this is related to bug#41184





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-17 14:54 ` Andrii Kolomoiets
@ 2020-08-17 15:55   ` Mattias Engdegård
  0 siblings, 0 replies; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-17 15:55 UTC (permalink / raw)
  To: Andrii Kolomoiets; +Cc: 42904, Alan Third

merge 42094 41184

17 aug. 2020 kl. 16.54 skrev Andrii Kolomoiets <andreyk.mad@gmail.com>:

> Looks like this is related to bug#41184

Indeed it's the same bug, thank you!






^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-17 14:11 bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS Mattias Engdegård
  2020-08-17 14:54 ` Andrii Kolomoiets
@ 2020-08-17 15:55 ` Eli Zaretskii
  2020-08-17 16:11   ` Mattias Engdegård
  1 sibling, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-17 15:55 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, alan

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Mon, 17 Aug 2020 16:11:52 +0200
> Cc: Alan Third <alan@idiocy.org>
> 
> Proposed patch attached. I didn't find any obvious way to encode an
> Emacs string into valid UTF-8 (with bad parts replaced) so a new
> function was written.

Is something wrong with encode_string_utf_8?  It has arguments that
allow you to replace invalid bytes into the likes of u+FFFD.  Or did I
misunderstand the problem you are facing?





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-17 15:55 ` Eli Zaretskii
@ 2020-08-17 16:11   ` Mattias Engdegård
  2020-08-17 17:05     ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-17 16:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, alan

17 aug. 2020 kl. 17.55 skrev Eli Zaretskii <eliz@gnu.org>:

> Is something wrong with encode_string_utf_8?  It has arguments that
> allow you to replace invalid bytes into the likes of u+FFFD.  Or did I
> misunderstand the problem you are facing?

No, that's a valid question. I did try that function first, but it had too many quirks: doesn't accept a unibyte non-ASCII string, sometimes replaces valid characters, doesn't always output UTF-8... It was easier to write a new function which encapsulates the common usage case. In addition, the new function is short and simple enough that it can easily be verified to be correct; encode_string_utf_8 is big and complex.

In addition, it seems likely that the same problem exists elsewhere and it's useful to have a function which solves it right away.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-17 16:11   ` Mattias Engdegård
@ 2020-08-17 17:05     ` Eli Zaretskii
  2020-08-17 18:48       ` Mattias Engdegård
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-17 17:05 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, alan

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Mon, 17 Aug 2020 18:11:50 +0200
> Cc: 42904@debbugs.gnu.org, alan@idiocy.org
> 
> 17 aug. 2020 kl. 17.55 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Is something wrong with encode_string_utf_8?  It has arguments that
> > allow you to replace invalid bytes into the likes of u+FFFD.  Or did I
> > misunderstand the problem you are facing?
> 
> No, that's a valid question. I did try that function first, but it had too many quirks: doesn't accept a unibyte non-ASCII string, sometimes replaces valid characters, doesn't always output UTF-8... It was easier to write a new function which encapsulates the common usage case. In addition, the new function is short and simple enough that it can easily be verified to be correct; encode_string_utf_8 is big and complex.

Well, it is always easier to special-case some use case, but we have
general APIs for a reason.  In particular, having several similar but
subtly different functions is confusing and causes mistakes.

And you seem to be saying that encode_string_utf_8 doesn't work as
advertised, which means it should be fixed.

So I would prefer to use encode_string_utf_8 if reasonably practical.

Thanks.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-17 17:05     ` Eli Zaretskii
@ 2020-08-17 18:48       ` Mattias Engdegård
  2020-08-17 19:56         ` Alan Third
  0 siblings, 1 reply; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-17 18:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, alan

17 aug. 2020 kl. 19.05 skrev Eli Zaretskii <eliz@gnu.org>:

> Well, it is always easier to special-case some use case, but we have
> general APIs for a reason.  In particular, having several similar but
> subtly different functions is confusing and causes mistakes.

The new function is much simpler and easier to use than encode_string_utf_8 precisely for that reason: to avoid confusion and mistakes, both of which I got in spades when trying to use it.

> And you seem to be saying that encode_string_utf_8 doesn't work as
> advertised, which means it should be fixed.

Actually I don't know exactly how it is supposed to work so I wouldn't even say that. It's probably fine code but it's not for me, not in this case.

> So I would prefer to use encode_string_utf_8 if reasonably practical.

Well, it doesn't seem to be reasonably practical. In order to fix a bug, I prefer not having to fix some unrelated but complex code, especially when it is unclear how and if that code really can and/or should be 'fixed', and exactly what that would entail.

Now if, after the proposed patch has been applied, someone wants to refactor so that string_to_valid_utf_8 disappears or becomes implemented in terms of something else, then that's perfectly fine, as long as the bug remains fixed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-17 18:48       ` Mattias Engdegård
@ 2020-08-17 19:56         ` Alan Third
  2020-08-18  8:07           ` Mattias Engdegård
  0 siblings, 1 reply; 29+ messages in thread
From: Alan Third @ 2020-08-17 19:56 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904

On Mon, Aug 17, 2020 at 08:48:08PM +0200, Mattias Engdegård wrote:
> 17 aug. 2020 kl. 19.05 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Well, it is always easier to special-case some use case, but we have
> > general APIs for a reason.  In particular, having several similar but
> > subtly different functions is confusing and causes mistakes.
> 
> The new function is much simpler and easier to use than
> encode_string_utf_8 precisely for that reason: to avoid confusion
> and mistakes, both of which I got in spades when trying to use it.

Sorry if this is a stupid question, but would using UTF-16 be easier?
This appears to work (although I'm sure it's not the right way to do this):

modified   src/nsfns.m
@@ -405,11 +405,10 @@ Turn the input menu (an NSMenu) into a lisp list for tracking on lisp side.
   NSString *str;
   NSView *view = FRAME_NS_VIEW (f);
 
+  encoded_name = code_convert_string_norecord (name, Qutf_16le, 1);
 
-  encoded_name = ENCODE_UTF_8 (name);
-
-  str = [NSString stringWithUTF8String: SSDATA (encoded_name)];
-
+  str = [NSString stringWithCharacters: (const unichar *) SDATA (encoded_name)
+                                length: SBYTES (encoded_name) / sizeof (unichar)];
 
   /* Don't change the name if it's already NAME.  */
   if (! [[[view window] title] isEqualToString: str])

-- 
Alan Third





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-17 19:56         ` Alan Third
@ 2020-08-18  8:07           ` Mattias Engdegård
  2020-08-18  8:43             ` Alan Third
  0 siblings, 1 reply; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-18  8:07 UTC (permalink / raw)
  To: Alan Third; +Cc: 42904

17 aug. 2020 kl. 21.56 skrev Alan Third <alan@idiocy.org>:

> Sorry if this is a stupid question, but would using UTF-16 be easier?
> This appears to work (although I'm sure it's not the right way to do this):

A good question! It has the advantage of requiring no new code, but is slightly inferior in that raw bytes are not replaced with U+FFFD but with spaces; we should probably set :default-char to #xfffd for the utf-16 coding systems.

Unpaired surrogates are handled splendidly by accident: the conversion to UTF-16 preserves them, perhaps incorrectly so,  but the NS libs display them as a distinctive and very suggestive glyph. Apparently [NSString stringWithCharacters:] doesn't perform any validation at all.

On the other hand, I think we still do need a subroutine for converting to UTF-8 for passing strings to system code where graceful handling of invalid encoding cannot be assumed, as there appears to be nothing in Emacs that can do this.

> +  encoded_name = code_convert_string_norecord (name, Qutf_16le, 1);

Presumably this should be utf_16be on big-endian platforms. We still support PowerPC macOS, don't we?

> +  str = [NSString stringWithCharacters: (const unichar *) SDATA (encoded_name)

Is SDATA guaranteed to be 16-bit aligned? Doesn't matter on x86 or PowerPC, but strictly speaking...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18  8:07           ` Mattias Engdegård
@ 2020-08-18  8:43             ` Alan Third
  2020-08-18 11:48               ` Mattias Engdegård
  2020-08-18 12:24               ` Eli Zaretskii
  0 siblings, 2 replies; 29+ messages in thread
From: Alan Third @ 2020-08-18  8:43 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904

On Tue, Aug 18, 2020 at 10:07:27AM +0200, Mattias Engdegård wrote:
> 17 aug. 2020 kl. 21.56 skrev Alan Third <alan@idiocy.org>:
> 
> > +  encoded_name = code_convert_string_norecord (name, Qutf_16le, 1);
> 
> Presumably this should be utf_16be on big-endian platforms. We still support PowerPC macOS, don't we?

No, however I imagine we support GNUstep on big endian systems.

> > +  str = [NSString stringWithCharacters: (const unichar *) SDATA (encoded_name)
> 
> Is SDATA guaranteed to be 16-bit aligned? Doesn't matter on x86 or
> PowerPC, but strictly speaking...

I've no idea, I adapted the code from make_multibyte_string in
alloc.c, and one of it's callers (although I can't remember which
right now). I'm expecting Eli to appear and tell me this is the
entirely wrong way of doing this. ;)

Anyway, as I understand it the internal representation of NS strings
are UTF-16, so the conversion through UTF-8 seems a bit of a waste if
we can go direct.
-- 
Alan Third





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18  8:43             ` Alan Third
@ 2020-08-18 11:48               ` Mattias Engdegård
  2020-08-18 12:22                 ` Eli Zaretskii
  2020-08-18 17:28                 ` Alan Third
  2020-08-18 12:24               ` Eli Zaretskii
  1 sibling, 2 replies; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-18 11:48 UTC (permalink / raw)
  To: Alan Third; +Cc: 42904

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]

18 aug. 2020 kl. 10.43 skrev Alan Third <alan@idiocy.org>:

>> Presumably this should be utf_16be on big-endian platforms. We still support PowerPC macOS, don't we?
> 
> No, however I imagine we support GNUstep on big endian systems.

Well then, it's easy to deal with.

>>> +  str = [NSString stringWithCharacters: (const unichar *) SDATA (encoded_name)
>> 
>> Is SDATA guaranteed to be 16-bit aligned? Doesn't matter on x86 or
>> PowerPC, but strictly speaking...
> 
> I've no idea, I adapted the code from make_multibyte_string in
> alloc.c, and one of it's callers (although I can't remember which
> right now). I'm expecting Eli to appear and tell me this is the
> entirely wrong way of doing this. ;)

Data alignment trapping is optional on 64-bit ARM but I'd be surprised if macOS enabled it. It might be hazardous for all the GNUStep-on-MIPS workstations.

> Anyway, as I understand it the internal representation of NS strings
> are UTF-16, so the conversion through UTF-8 seems a bit of a waste if
> we can go direct.

Maybe, but the conversion to UTF-16 then has to be done on the Emacs side instead, probably less efficiently than in the NS libs. It's probably a wash.

Anyway, here is an alternative patch using your method. Tell us what you think.


[-- Attachment #2: 0001-UTF-16-Fix-NS-crash-on-invalid-frame-title-string-bug-42904.patch --]
[-- Type: application/octet-stream, Size: 5209 bytes --]

From 9858f0409cc83eefcc153109e180c230868d20a5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 18 Aug 2020 12:58:12 +0200
Subject: [PATCH] Fix NS crash on invalid frame title string (bug#42904)

Instead of blindly assuming that all Emacs strings are valid UTF-8,
which they are not, use UTF-16 because this conversion seems to be
more robust (after making sure that they use the correct replacement
character).  Unpaired surrogates will still go through to the NSString
objects, but the NS libs handle them gracefully.

* src/nsfns.m (string_to_nsstring): New function.
(ns_set_name_internal): Use string_to_nsstring.
* lisp/international/mule-conf.el (utf-16le, utf-16be)
(utf-16le-with-signature, utf-16be-with-signature, utf-16):
Use U+FFFD as :default-char instead of ASCII space.
* test/src/coding-tests.el (coding-utf-16-replacement-char): New test.
---
 lisp/international/mule-conf.el |  5 +++++
 src/nsfns.m                     | 34 +++++++++++++++++++--------------
 test/src/coding-tests.el        | 12 ++++++++++++
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/lisp/international/mule-conf.el b/lisp/international/mule-conf.el
index edda79ba4e..b9acafc158 100644
--- a/lisp/international/mule-conf.el
+++ b/lisp/international/mule-conf.el
@@ -1336,6 +1336,7 @@ 'utf-16le
   :mnemonic ?U
   :charset-list '(unicode)
   :endian 'little
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16le)
 
@@ -1345,6 +1346,7 @@ 'utf-16be
   :mnemonic ?U
   :charset-list '(unicode)
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16be)
 
@@ -1355,6 +1357,7 @@ 'utf-16le-with-signature
   :charset-list '(unicode)
   :bom t
   :endian 'little
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
@@ -1365,6 +1368,7 @@ 'utf-16be-with-signature
   :charset-list '(unicode)
   :bom t
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
@@ -1375,6 +1379,7 @@ 'utf-16
   :charset-list '(unicode)
   :bom '(utf-16le-with-signature . utf-16be-with-signature)
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
diff --git a/src/nsfns.m b/src/nsfns.m
index 628233ea0d..e514585dc1 100644
--- a/src/nsfns.m
+++ b/src/nsfns.m
@@ -398,29 +398,35 @@ Turn the input menu (an NSMenu) into a lisp list for tracking on lisp side.
         [NSString stringWithUTF8String: SSDATA (arg)]];
 }
 
+static NSString *
+string_to_nsstring (Lisp_Object string)
+{
+  Lisp_Object coding =
+#ifdef WORDS_BIGENDIAN
+    Qutf_16be
+#else
+    Qutf_16le
+#endif
+    ;
+  Lisp_Object utf16 = code_convert_string_norecord (string, coding, true);
+  /* Here we somewhat precariously assume that SDATA is 16-bit aligned,
+     or that unaligned access is OK.  */
+  return [NSString stringWithCharacters: (const unichar *) SDATA (utf16)
+                                 length: SBYTES (utf16) / sizeof (unichar)];
+}
+
 static void
 ns_set_name_internal (struct frame *f, Lisp_Object name)
 {
-  Lisp_Object encoded_name, encoded_icon_name;
-  NSString *str;
   NSView *view = FRAME_NS_VIEW (f);
-
-
-  encoded_name = ENCODE_UTF_8 (name);
-
-  str = [NSString stringWithUTF8String: SSDATA (encoded_name)];
-
+  NSString *str = string_to_nsstring (name);
 
   /* Don't change the name if it's already NAME.  */
   if (! [[[view window] title] isEqualToString: str])
     [[view window] setTitle: str];
 
-  if (!STRINGP (f->icon_name))
-    encoded_icon_name = encoded_name;
-  else
-    encoded_icon_name = ENCODE_UTF_8 (f->icon_name);
-
-  str = [NSString stringWithUTF8String: SSDATA (encoded_icon_name)];
+  if (STRINGP (f->icon_name))
+    str = string_to_nsstring (f->icon_name);
 
   if ([[view window] miniwindowTitle]
       && ! [[[view window] miniwindowTitle] isEqualToString: str])
diff --git a/test/src/coding-tests.el b/test/src/coding-tests.el
index c438ae22ce..8b0adf0ad8 100644
--- a/test/src/coding-tests.el
+++ b/test/src/coding-tests.el
@@ -429,6 +429,18 @@ coding-check-coding-systems-region
                  '((iso-latin-1 3) (us-ascii 1 3))))
   (should-error (check-coding-systems-region "å" nil '(bad-coding-system))))
 
+(ert-deftest coding-utf-16-replacement-char ()
+  (should (equal (encode-coding-string "A\351B" 'utf-16be)
+                 (unibyte-string 0 ?A #xff #xfd 0 ?B)))
+  (should (equal (encode-coding-string "A\351B" 'utf-16le)
+                 (unibyte-string ?A 0 #xfd #xff ?B 0)))
+  (should (equal (encode-coding-string "A\ud8b6BΣ\227D𝄞" 'utf-16be)
+                 (unibyte-string 0 ?A #xd8 #xb6 0 ?B #x03 #xa3 #xff #xfd 0 ?D
+                                 #xd8 #x34 #xdd #x1e)))
+  (should (equal (encode-coding-string "A\ud8b6BΣ\227D𝄞" 'utf-16le)
+                 (unibyte-string ?A 0 #xb6 #xd8 ?B 0 #xa3 #x03 #xfd #xff ?D 0
+                                 #x34 #xd8 #x1e #xdd))))
+
 ;; Local Variables:
 ;; byte-compile-warnings: (not obsolete)
 ;; End:
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18 11:48               ` Mattias Engdegård
@ 2020-08-18 12:22                 ` Eli Zaretskii
  2020-08-18 17:28                 ` Alan Third
  1 sibling, 0 replies; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-18 12:22 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, alan

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 18 Aug 2020 13:48:10 +0200
> Cc: Eli Zaretskii <eliz@gnu.org>, 42904@debbugs.gnu.org
> 
> Anyway, here is an alternative patch using your method. Tell us what you think.

Thanks, but I don't think we should modify the :default-char attribute
of the UTF-* encodings as part of this change.  It's a separate issue,
and is a backward-incompatible change of sorts.  For instance, we
should consider what this will do to display on TTY frames that don't
support Unicode.  So I think we should discuss this issue separately
before we make such a change.

Why is it a problem to display a space instead of invalid bytes in
this case?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18  8:43             ` Alan Third
  2020-08-18 11:48               ` Mattias Engdegård
@ 2020-08-18 12:24               ` Eli Zaretskii
  2020-08-18 14:11                 ` Mattias Engdegård
  1 sibling, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-18 12:24 UTC (permalink / raw)
  To: Alan Third; +Cc: 42904, mattiase

> Date: Tue, 18 Aug 2020 10:43:10 +0200 (CEST)
> From: Alan Third <alan@idiocy.org>
> Cc: Eli Zaretskii <eliz@gnu.org>, 42904@debbugs.gnu.org
> 
> > Is SDATA guaranteed to be 16-bit aligned? Doesn't matter on x86 or
> > PowerPC, but strictly speaking...
> 
> I've no idea, I adapted the code from make_multibyte_string in
> alloc.c, and one of it's callers (although I can't remember which
> right now). I'm expecting Eli to appear and tell me this is the
> entirely wrong way of doing this. ;)

It isn't wrong (and there's no need to worry about alignment in this
case, AFAIK).

Thanks.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18 12:24               ` Eli Zaretskii
@ 2020-08-18 14:11                 ` Mattias Engdegård
  2020-08-18 14:40                   ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-18 14:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, Alan Third

18 aug. 2020 kl. 14.24 skrev Eli Zaretskii <eliz@gnu.org>:

> It isn't wrong (and there's no need to worry about alignment in this
> case, AFAIK).

Do you mean that SDATA is guaranteed to be aligned, or that no NS platforms that Emacs runs on (or is likely to run on in the near future, such as macOS on arm64) trap on unaligned?

> Thanks, but I don't think we should modify the :default-char attribute
> of the UTF-* encodings as part of this change.  It's a separate issue,
> and is a backward-incompatible change of sorts.  For instance, we
> should consider what this will do to display on TTY frames that don't
> support Unicode.  So I think we should discuss this issue separately
> before we make such a change.

Yes, we can certainly make it a separate change. All bug fixes are backward-incompatible in some respect; it is not reasonable to depend on non-Unicode characters being translated to spaces when converted to UTF-16 since that is neither documented nor reasonably expected behaviour.

> Why is it a problem to display a space instead of invalid bytes in
> this case?

A problem is not necessary for a change to be desirable. The Unicode replacement character clearly indicates that something could not be encoded correctly, and the exact position for it; it's universally recognised and valuable for users and developers alike. Space is the default value for :default-char, and that it isn't U+FFFD for UTF-16 (or other Unicode encodings) is a clear bug, since that is the correct character to use for that purpose.

My guess is that space was chosen as default because it's a character that occurs in all coding systems, but it is clearly wrong for UTF-16. 'us-ascii' uses '?' for :default-char, which is a better character in that repertoire.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18 14:11                 ` Mattias Engdegård
@ 2020-08-18 14:40                   ` Eli Zaretskii
  2020-08-18 15:21                     ` Mattias Engdegård
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-18 14:40 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, alan

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 18 Aug 2020 16:11:02 +0200
> Cc: Alan Third <alan@idiocy.org>, 42904@debbugs.gnu.org
> 
> 18 aug. 2020 kl. 14.24 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > It isn't wrong (and there's no need to worry about alignment in this
> > case, AFAIK).
> 
> Do you mean that SDATA is guaranteed to be aligned, or that no NS platforms that Emacs runs on (or is likely to run on in the near future, such as macOS on arm64) trap on unaligned?

Both, AFAIK.

> it is not reasonable to depend on non-Unicode characters being translated to spaces

We are not the only program which does that, but.

> > Why is it a problem to display a space instead of invalid bytes in
> > this case?
> 
> A problem is not necessary for a change to be desirable. The Unicode replacement character clearly indicates that something could not be encoded correctly, and the exact position for it

But that character only makes sense when it can be displayed, because
otherwise no one will realize what was the problem.

Anyway, this discussion should be on emacs-devel, not as part of an
unrelated bug report.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18 14:40                   ` Eli Zaretskii
@ 2020-08-18 15:21                     ` Mattias Engdegård
  0 siblings, 0 replies; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-18 15:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, alan

18 aug. 2020 kl. 16.40 skrev Eli Zaretskii <eliz@gnu.org>:

>>> It isn't wrong (and there's no need to worry about alignment in this
>>> case, AFAIK).
>> 
>> Do you mean that SDATA is guaranteed to be aligned, or that no NS platforms that Emacs runs on (or is likely to run on in the near future, such as macOS on arm64) trap on unaligned?
> 
> Both, AFAIK.

Thank you. I'm still wary about making such assumptions but I suppose we commit worse sins.

> But that character only makes sense when it can be displayed, because
> otherwise no one will realize what was the problem.

Certainly. Given that TTYs aren't typically used for displaying UTF-16, and that the Unicode-capable terminals I've tried seem to show just fine (in case someone converts the UTF-16 back to UTF-8), I think we are reasonably safe.
In any case it's no different from not being able to show an any other character.

> Anyway, this discussion should be on emacs-devel, not as part of an
> unrelated bug report.

Not unrelated at all, but by all means.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18 11:48               ` Mattias Engdegård
  2020-08-18 12:22                 ` Eli Zaretskii
@ 2020-08-18 17:28                 ` Alan Third
  2020-08-20  9:27                   ` Mattias Engdegård
  1 sibling, 1 reply; 29+ messages in thread
From: Alan Third @ 2020-08-18 17:28 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904

On Tue, Aug 18, 2020 at 01:48:10PM +0200, Mattias Engdegård wrote:
> 18 aug. 2020 kl. 10.43 skrev Alan Third <alan@idiocy.org>:
> > Anyway, as I understand it the internal representation of NS strings
> > are UTF-16, so the conversion through UTF-8 seems a bit of a waste if
> > we can go direct.
> 
> Maybe, but the conversion to UTF-16 then has to be done on the Emacs
> side instead, probably less efficiently than in the NS libs. It's
> probably a wash.
> 
> Anyway, here is an alternative patch using your method. Tell us what
> you think.

Looks good to me. The only thought I have is that perhaps we should
consider extending NSString to handle these lisp strings rather than
making it a separate function? We could provide a method to convert to
a lisp string as well, although that's not as complex.

I believe using categories would do it without us having to create a
new EmacsString class or similar.

I don't know if this is worth it because I don't know if we really
need these clean conversions elsewhere, but the neatness of

    newStr = [NSString withLispObject:str];

appeals. :)
-- 
Alan Third





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-18 17:28                 ` Alan Third
@ 2020-08-20  9:27                   ` Mattias Engdegård
  2020-08-20 13:24                     ` Eli Zaretskii
  2020-08-20 13:24                     ` Alan Third
  0 siblings, 2 replies; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-20  9:27 UTC (permalink / raw)
  To: Alan Third; +Cc: 42904

[-- Attachment #1: Type: text/plain, Size: 1329 bytes --]

18 aug. 2020 kl. 19.28 skrev Alan Third <alan@idiocy.org>:

> Looks good to me. The only thought I have is that perhaps we should
> consider extending NSString to handle these lisp strings rather than
> making it a separate function? We could provide a method to convert to
> a lisp string as well, although that's not as complex.
> 
> I believe using categories would do it without us having to create a
> new EmacsString class or similar.

Fun, I hadn't done that before! Of course we should.

As it happens I just enjoyed the HOPL paper about the history of Objective-C (https://dl.acm.org/doi/10.1145/3386332). An excellent read in general, and it has some history about how the categories came about.

Here is an updated patch: it is now self-contained and does not change anything outside the NS backend.

There is a minor imperfection: the incoming name string can actually be miscoded if it contains both non-ASCII characters and raw bytes. As an example, consider

 (rename-buffer "aéb\300")

In xdisp.c:12497, the Lisp name string is created using make_string which decides that the above multibyte string should really be unibyte, and that confuses the converter. It is of no great consequence, but it makes the result look messier than it should have: "a��b��c" instead of "aéb�c".


[-- Attachment #2: 0001-Fix-NS-crash-on-invalid-frame-title-string-bug-42904.patch --]
[-- Type: application/octet-stream, Size: 4140 bytes --]

From 18c2f2d8aee5ca6708232477d12dc1d884c75235 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 18 Aug 2020 12:58:12 +0200
Subject: [PATCH] Fix NS crash on invalid frame title string (bug#42904)

Instead of blindly assuming that all Emacs strings are valid UTF-8,
which they are not, use a more careful conversion going via UTF-16
which is what NSString uses internally.  Unpaired surrogates will
still go through to the NSString objects, but the NS libs handle them
gracefully.

* src/nsterm.h (EmacsString): New category.
* src/nsfns.m (all_nonzero_ascii): New helper function.
([NSString stringWithLispString:]): New method.
(ns_set_name_internal): Use new conversion method.
---
 src/nsfns.m  | 65 +++++++++++++++++++++++++++++++++++++++++-----------
 src/nsterm.h |  5 ++++
 2 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/src/nsfns.m b/src/nsfns.m
index 628233ea0d..8357b8c290 100644
--- a/src/nsfns.m
+++ b/src/nsfns.m
@@ -398,29 +398,66 @@ Turn the input menu (an NSMenu) into a lisp list for tracking on lisp side.
         [NSString stringWithUTF8String: SSDATA (arg)]];
 }
 
+/* Whether N bytes at STR are in the [0,127] range.  */
+static bool
+all_nonzero_ascii (unsigned char *str, ptrdiff_t n)
+{
+  for (ptrdiff_t i = 0; i < n; i++)
+    if (str[i] < 1 || str[i] > 127)
+      return false;
+  return true;
+}
+
+@implementation NSString (EmacsString)
+/* Make an NSString from a Lisp string.  */
++ (NSString *)stringWithLispString:(Lisp_Object)string
+{
+  /* Shortcut for the common case.  */
+  if (all_nonzero_ascii (SDATA (string), SBYTES (string)))
+    return [NSString stringWithCString: SSDATA (string)
+                              encoding: NSASCIIStringEncoding];
+  string = string_to_multibyte (string);
+
+  /* Now the string is multibyte; convert to UTF-16.  */
+  unichar *chars = xmalloc (4 * SCHARS (string));
+  unichar *d = chars;
+  const unsigned char *s = SDATA (string);
+  const unsigned char *end = s + SBYTES (string);
+  while (s < end)
+    {
+      int c = string_char_advance (&s);
+      /* We pass unpaired surrogates through, because they are typically
+         handled fairly well by the NS libraries (displayed with distinct
+         glyphs etc).  */
+      if (c <= 0xffff)
+        *d++ = c;
+      else if (c <= 0x10ffff)
+        {
+          *d++ = 0xd800 + (c & 0x3ff);
+          *d++ = 0xdc00 + ((c - 0x10000) >> 10);
+        }
+      else
+        *d++ = 0xfffd;          /* Not valid for UTF-16.  */
+    }
+  NSString *str = [NSString stringWithCharacters: chars
+                                          length: d - chars];
+  xfree (chars);
+  return str;
+}
+@end
+
 static void
 ns_set_name_internal (struct frame *f, Lisp_Object name)
 {
-  Lisp_Object encoded_name, encoded_icon_name;
-  NSString *str;
   NSView *view = FRAME_NS_VIEW (f);
-
-
-  encoded_name = ENCODE_UTF_8 (name);
-
-  str = [NSString stringWithUTF8String: SSDATA (encoded_name)];
-
+  NSString *str = [NSString stringWithLispString: name];
 
   /* Don't change the name if it's already NAME.  */
   if (! [[[view window] title] isEqualToString: str])
     [[view window] setTitle: str];
 
-  if (!STRINGP (f->icon_name))
-    encoded_icon_name = encoded_name;
-  else
-    encoded_icon_name = ENCODE_UTF_8 (f->icon_name);
-
-  str = [NSString stringWithUTF8String: SSDATA (encoded_icon_name)];
+  if (STRINGP (f->icon_name))
+    str = [NSString stringWithLispString: f->icon_name];
 
   if ([[view window] miniwindowTitle]
       && ! [[[view window] miniwindowTitle] isEqualToString: str])
diff --git a/src/nsterm.h b/src/nsterm.h
index a511fef5b9..ab868ed344 100644
--- a/src/nsterm.h
+++ b/src/nsterm.h
@@ -361,6 +361,11 @@ #define NS_DRAW_TO_BUFFER 1
 
 @end
 
+
+@interface NSString (EmacsString)
++ (NSString *)stringWithLispString:(Lisp_Object)string;
+@end
+
 /* ==========================================================================
 
    The Emacs application
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-20  9:27                   ` Mattias Engdegård
@ 2020-08-20 13:24                     ` Eli Zaretskii
  2020-08-20 18:46                       ` Mattias Engdegård
  2020-08-20 13:24                     ` Alan Third
  1 sibling, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-20 13:24 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, alan

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 20 Aug 2020 11:27:01 +0200
> Cc: Eli Zaretskii <eliz@gnu.org>, 42904@debbugs.gnu.org
> 
> There is a minor imperfection: the incoming name string can actually be miscoded if it contains both non-ASCII characters and raw bytes. As an example, consider
> 
>  (rename-buffer "aéb\300")
> 
> In xdisp.c:12497, the Lisp name string is created using make_string which decides that the above multibyte string should really be unibyte, and that confuses the converter. It is of no great consequence, but it makes the result look messier than it should have: "a��b��c" instead of "aéb�c".

What would you like xdisp.c to do instead in this case?  If there's an
alternative way of dealing with such frame titles that is better in
some sense, we could either adopt it for all platforms, or only for
NS.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-20  9:27                   ` Mattias Engdegård
  2020-08-20 13:24                     ` Eli Zaretskii
@ 2020-08-20 13:24                     ` Alan Third
  2020-08-20 17:44                       ` Mattias Engdegård
  1 sibling, 1 reply; 29+ messages in thread
From: Alan Third @ 2020-08-20 13:24 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904

On Thu, Aug 20, 2020 at 11:27:01AM +0200, Mattias Engdegård wrote:
> 18 aug. 2020 kl. 19.28 skrev Alan Third <alan@idiocy.org>:
> 
> > Looks good to me. The only thought I have is that perhaps we should
> > consider extending NSString to handle these lisp strings rather than
> > making it a separate function? We could provide a method to convert to
> > a lisp string as well, although that's not as complex.
> > 
> > I believe using categories would do it without us having to create a
> > new EmacsString class or similar.
> 
> Fun, I hadn't done that before! Of course we should.
> 
> As it happens I just enjoyed the HOPL paper about the history of
> Objective-C (https://dl.acm.org/doi/10.1145/3386332). An excellent
> read in general, and it has some history about how the categories
> came about.

I haven't seen that before and am just reading through it now. Thanks.

> Here is an updated patch: it is now self-contained and does not
> change anything outside the NS backend.

It looks good to me. The only thing I'd like you to change is to move
the implementation down to the "Class implementations" part of
nsfns.m.

> There is a minor imperfection: the incoming name string can actually
> be miscoded if it contains both non-ASCII characters and raw bytes.
> As an example, consider
> 
>  (rename-buffer "aéb\300")
> 
> In xdisp.c:12497, the Lisp name string is created using make_string
> which decides that the above multibyte string should really be
> unibyte, and that confuses the converter. It is of no great
> consequence, but it makes the result look messier than it should
> have: "a��b��c" instead of "aéb�c".

I think we can live with that, it's definitely better than a crash and
seems reasonable given that the input is junk.

Thanks!
-- 
Alan Third





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-20 13:24                     ` Alan Third
@ 2020-08-20 17:44                       ` Mattias Engdegård
  0 siblings, 0 replies; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-20 17:44 UTC (permalink / raw)
  To: Alan Third; +Cc: 42904-done

20 aug. 2020 kl. 15.24 skrev Alan Third <alan@idiocy.org>:

> It looks good to me. The only thing I'd like you to change is to move
> the implementation down to the "Class implementations" part of
> nsfns.m.

Moved, and pushed. Thank you!






^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-20 13:24                     ` Eli Zaretskii
@ 2020-08-20 18:46                       ` Mattias Engdegård
  2020-08-20 19:13                         ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-20 18:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, alan

20 aug. 2020 kl. 15.24 skrev Eli Zaretskii <eliz@gnu.org>:

> What would you like xdisp.c to do instead in this case?  If there's an
> alternative way of dealing with such frame titles that is better in
> some sense, we could either adopt it for all platforms, or only for
> NS.

Not sure how to deal with it, but maybe it's just a matter of settling on multibyte representation when building the title (as in mode_line_noprop_buf and so on)? I presume that the current ambiguity comes from when there were good reasons to build these strings in various unibyte encodings, but maybe it isn't motivated today?

If it is at all any trouble at all, just leave it as it is. On the other hand, perhaps we have found a way to simplify the code by accident.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-20 18:46                       ` Mattias Engdegård
@ 2020-08-20 19:13                         ` Eli Zaretskii
  2020-08-21  9:39                           ` Mattias Engdegård
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-20 19:13 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, alan

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 20 Aug 2020 20:46:17 +0200
> Cc: alan@idiocy.org, 42904@debbugs.gnu.org
> 
> 20 aug. 2020 kl. 15.24 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > What would you like xdisp.c to do instead in this case?  If there's an
> > alternative way of dealing with such frame titles that is better in
> > some sense, we could either adopt it for all platforms, or only for
> > NS.
> 
> Not sure how to deal with it, but maybe it's just a matter of settling on multibyte representation when building the title (as in mode_line_noprop_buf and so on)?

I don't think I understand.  mode_line_noprop_buf gets the bytes, and
then we call make_string on it, so the result is the same as the one
you'd like to avoid.  Or am I missing something?

By "settling on multibyte representation", do you mean that we should
convert raw bytes to their multibyte form?  Or do you mean something
else?

> I presume that the current ambiguity comes from when there were good reasons to build these strings in various unibyte encodings, but maybe it isn't motivated today?

Again, what would you like to have instead?  Would calling
str_as_multibyte do what you want?

The reason we build a unibyte string is that the presence of raw bytes
generally means a unibyte string is desired; it's a heuristic.  It is
also the simplest thing to do in this case, and always works because
it doesn't change the byte sequence of the original string.

> If it is at all any trouble at all, just leave it as it is. On the other hand, perhaps we have found a way to simplify the code by accident.

See above: maybe str_as_multibyte is what you want?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-20 19:13                         ` Eli Zaretskii
@ 2020-08-21  9:39                           ` Mattias Engdegård
  2020-08-21 13:26                             ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-21  9:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, alan

20 aug. 2020 kl. 21.13 skrev Eli Zaretskii <eliz@gnu.org>:

> I don't think I understand.  mode_line_noprop_buf gets the bytes, and
> then we call make_string on it, so the result is the same as the one
> you'd like to avoid.  Or am I missing something?
> 
> By "settling on multibyte representation", do you mean that we should
> convert raw bytes to their multibyte form?  Or do you mean something
> else?

No, I think we are talking about the same thing. Basically, it's about how the bytes end up in mode_line_noprop_buf in the first place, since currently the information of whether it should be interpreted as unibyte or multibyte gets lost as soon as data from the strings it is composed of (like the buffer name for %b, file name for %f etc) is added to it. Then make_string tries to restore that information by looking at the bytes, and it is not always accurate.

One way of doing this is to always make sure that the input strings (buffer name, file name, frame-title-format etc) are always in multibyte form. Another would be to convert to multibyte as those strings are used, presumably in decode_mode_spec. You know this code a lot better than I do, but the former may be slightly more workable (and efficient).

> Again, what would you like to have instead?  Would calling
> str_as_multibyte do what you want?

No, I don't think so -- once the unibyte/multibyte bit is lost, it can only be restored imperfectly if all we have is the sequence of bytes. In mathematical terms, the function that maps an arbitrary string object to its bytes has no inverse. (Consider the unibyte string "\xc3\xa5" -- should the bytes {c3, a5} be recreated as that unibyte string, or as the multibyte string "å"?)

Again we are talking about trivialities here, but perhaps the same syndrome will arise in other contexts where it matters more. If we wrote Emacs from scratch we likely wouldn't have unibyte strings at all: they are only there for compatibility and various niche uses and performance hacks. I don't think it's unreasonable to start normalising strings to multibyte where it matters.

Thanks for your patience!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-21  9:39                           ` Mattias Engdegård
@ 2020-08-21 13:26                             ` Eli Zaretskii
  2020-08-21 14:53                               ` Mattias Engdegård
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-21 13:26 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, alan

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 21 Aug 2020 11:39:30 +0200
> Cc: alan@idiocy.org, 42904@debbugs.gnu.org
> 
> Basically, it's about how the bytes end up in mode_line_noprop_buf in the first place, since currently the information of whether it should be interpreted as unibyte or multibyte gets lost as soon as data from the strings it is composed of (like the buffer name for %b, file name for %f etc) is added to it. Then make_string tries to restore that information by looking at the bytes, and it is not always accurate.

make_string was written to work on byte sequences that don't begin as
the payload of a Lisp string.  So it doesn't handle the information
you say is being lost, because it doesn't expect such information to
be available to begin with.

Which is basically just another way of saying "you want something
other than make_string" here.

> One way of doing this is to always make sure that the input strings (buffer name, file name, frame-title-format etc) are always in multibyte form.

That's what I thought I was suggesting.

> > Again, what would you like to have instead?  Would calling
> > str_as_multibyte do what you want?
> 
> No, I don't think so -- once the unibyte/multibyte bit is lost, it can only be restored imperfectly if all we have is the sequence of bytes.

That is true, but str_as_multibyte simply interprets any valid UTF-8
sequence as a character, and any invalid sequence as a raw bytes.  I
thought this was precisely what you wanted for this use case, no?

> If we wrote Emacs from scratch we likely wouldn't have unibyte strings at all: they are only there for compatibility and various niche uses and performance hacks. I don't think it's unreasonable to start normalising strings to multibyte where it matters.

Emacs (as any other old editor) started with only unibyte strings, so
that's history for you.  Some modern text-handling environments solve
this conundrum by not supporting raw bytes at all, but Emacs knows
better.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-21 13:26                             ` Eli Zaretskii
@ 2020-08-21 14:53                               ` Mattias Engdegård
  2020-08-21 15:27                                 ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-21 14:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, alan

[-- Attachment #1: Type: text/plain, Size: 680 bytes --]

21 aug. 2020 kl. 15.26 skrev Eli Zaretskii <eliz@gnu.org>:

> That is true, but str_as_multibyte simply interprets any valid UTF-8
> sequence as a character, and any invalid sequence as a raw bytes.  I
> thought this was precisely what you wanted for this use case, no?

Sorry, I read the comment for that function and got the impression that it would interpret raw bytes as Latin-1. Fortunately that wasn't true, and using it seems to be a clear improvement. Now a mixture of non-ASCII and raw bytes, like "a\377büc" results in the title "a��büc", which is one � too many but good enough.

What about the attached patch then? Only tested on macOS, admittedly.


[-- Attachment #2: 0001-Always-make-a-multibyte-string-for-the-frame-title-b.patch --]
[-- Type: application/octet-stream, Size: 2000 bytes --]

From dbe9ee59a179a3d42f337c60b5f426f6ff2913ca Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 21 Aug 2020 16:09:04 +0200
Subject: [PATCH] Always make a multibyte string for the frame title
 (bug#42904)

* src/xdisp.c (gui_consider_frame_title): Multibyte-encode any raw
bytes in the title, and then pass a multibyte string to the back-end
for use as a frame title.  This cuts down a little on the rubbish
shown when raw bytes sneak in by mistake (as part of the buffer name,
for instance).
---
 src/xdisp.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/xdisp.c b/src/xdisp.c
index ad03ac4605..9eeae43a52 100644
--- a/src/xdisp.c
+++ b/src/xdisp.c
@@ -12482,6 +12482,11 @@ gui_consider_frame_title (Lisp_Object frame)
       display_mode_element (&it, 0, -1, -1, fmt, Qnil, false);
       len = MODE_LINE_NOPROP_LEN (title_start);
       title = mode_line_noprop_buf + title_start;
+      /* Make sure any raw bytes in the title are properly
+         multibyte-encoded.  */
+      ptrdiff_t nchars = 0;
+      len = str_as_multibyte (title, mode_line_noprop_buf_end - title,
+                              len, &nchars);
       unbind_to (count, Qnil);
 
       /* Set the title only if it's changed.  This avoids consing in
@@ -12493,9 +12498,10 @@ gui_consider_frame_title (Lisp_Object frame)
            || SBYTES (f->name) != len
            || memcmp (title, SDATA (f->name), len) != 0)
           && FRAME_TERMINAL (f)->implicit_set_name_hook)
-	FRAME_TERMINAL (f)->implicit_set_name_hook (f,
-                                                    make_string (title, len),
-                                                    Qnil);
+        {
+          Lisp_Object title_string = make_multibyte_string (title, nchars, len);
+          FRAME_TERMINAL (f)->implicit_set_name_hook (f, title_string, Qnil);
+        }
     }
 }
 
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-21 14:53                               ` Mattias Engdegård
@ 2020-08-21 15:27                                 ` Eli Zaretskii
  2020-08-21 15:50                                   ` Mattias Engdegård
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-08-21 15:27 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 42904, alan

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 21 Aug 2020 16:53:34 +0200
> Cc: alan@idiocy.org, 42904@debbugs.gnu.org
> 
> > That is true, but str_as_multibyte simply interprets any valid UTF-8
> > sequence as a character, and any invalid sequence as a raw bytes.  I
> > thought this was precisely what you wanted for this use case, no?
> 
> Sorry, I read the comment for that function and got the impression that it would interpret raw bytes as Latin-1.

That was a remnant from pre-Unicode Emacs; I've fixed the commentary
to accurately describe what happens now.

> Fortunately that wasn't true, and using it seems to be a clear improvement. Now a mixture of non-ASCII and raw bytes, like "a\377büc" results in the title "a��büc", which is one � too many but good enough.
> 
> What about the attached patch then? Only tested on macOS, admittedly.

It looks OK, but someone should see what it does on X before we make
this change on all platforms.  (On w32 frames, the display stops
before the first raw byte, but it also does that with the current
code.)  If the effect on X is for the worse, we will have to condition
this by HAVE_NS.

>        title = mode_line_noprop_buf + title_start;
> +      /* Make sure any raw bytes in the title are properly
> +         multibyte-encoded.  */

It is better not to use "encoded" when talking about internal
representation.  I'd say something like "represented by their
multibyte sequences" instead.

Thanks.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-21 15:27                                 ` Eli Zaretskii
@ 2020-08-21 15:50                                   ` Mattias Engdegård
  2020-08-23 17:23                                     ` Mattias Engdegård
  0 siblings, 1 reply; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-21 15:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, alan

21 aug. 2020 kl. 17.27 skrev Eli Zaretskii <eliz@gnu.org>:

> That was a remnant from pre-Unicode Emacs; I've fixed the commentary
> to accurately describe what happens now.

Much appreciated.

> It looks OK, but someone should see what it does on X before we make
> this change on all platforms.  (On w32 frames, the display stops
> before the first raw byte, but it also does that with the current
> code.)  If the effect on X is for the worse, we will have to condition
> this by HAVE_NS.

I will be able to test on X in a few days.

> It is better not to use "encoded" when talking about internal
> representation.  I'd say something like "represented by their
> multibyte sequences" instead.

Thank you, the comment will be amended.






^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS
  2020-08-21 15:50                                   ` Mattias Engdegård
@ 2020-08-23 17:23                                     ` Mattias Engdegård
  0 siblings, 0 replies; 29+ messages in thread
From: Mattias Engdegård @ 2020-08-23 17:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42904, alan

21 aug. 2020 kl. 17.50 skrev Mattias Engdegård <mattiase@acm.org>:

>> It looks OK, but someone should see what it does on X before we make
>> this change on all platforms.  (On w32 frames, the display stops
>> before the first raw byte, but it also does that with the current
>> code.)  If the effect on X is for the worse, we will have to condition
>> this by HAVE_NS.
> 
> I will be able to test on X in a few days.

Now tested on X with no regression observed, thus pushed to master. Thank you!






^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-08-23 17:23 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-17 14:11 bug#42904: [PATCH] Non-Unicode frame title crashes Emacs on macOS Mattias Engdegård
2020-08-17 14:54 ` Andrii Kolomoiets
2020-08-17 15:55   ` Mattias Engdegård
2020-08-17 15:55 ` Eli Zaretskii
2020-08-17 16:11   ` Mattias Engdegård
2020-08-17 17:05     ` Eli Zaretskii
2020-08-17 18:48       ` Mattias Engdegård
2020-08-17 19:56         ` Alan Third
2020-08-18  8:07           ` Mattias Engdegård
2020-08-18  8:43             ` Alan Third
2020-08-18 11:48               ` Mattias Engdegård
2020-08-18 12:22                 ` Eli Zaretskii
2020-08-18 17:28                 ` Alan Third
2020-08-20  9:27                   ` Mattias Engdegård
2020-08-20 13:24                     ` Eli Zaretskii
2020-08-20 18:46                       ` Mattias Engdegård
2020-08-20 19:13                         ` Eli Zaretskii
2020-08-21  9:39                           ` Mattias Engdegård
2020-08-21 13:26                             ` Eli Zaretskii
2020-08-21 14:53                               ` Mattias Engdegård
2020-08-21 15:27                                 ` Eli Zaretskii
2020-08-21 15:50                                   ` Mattias Engdegård
2020-08-23 17:23                                     ` Mattias Engdegård
2020-08-20 13:24                     ` Alan Third
2020-08-20 17:44                       ` Mattias Engdegård
2020-08-18 12:24               ` Eli Zaretskii
2020-08-18 14:11                 ` Mattias Engdegård
2020-08-18 14:40                   ` Eli Zaretskii
2020-08-18 15:21                     ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).