Use the Unicode replacement character for replacing unencodable characters into UTF-16

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Use the Unicode replacement character for replacing unencodable characters into UTF-16
@ 2020-08-18 15:36 Mattias Engdegård
  2020-08-18 16:19 ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Mattias Engdegård @ 2020-08-18 15:36 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 471 bytes --]

The attached patch makes sure that non-Unicode characters are replaced with U+FFFD REPLACEMENT CHARACTER instead of a space when converting to UTF-16. (The space is from all evidence a historical accident.)

This change is required for one possible solution of bug#42904. We can do without this patch, but it fixes a clear bug.

For some reason, unpaired surrogates aren't affected despite not being encodable in UTF-16 -- another bug, but not one addressed here.


[-- Attachment #2: 0001-Use-Unicode-replacement-character-for-unencodable-UT.patch --]
[-- Type: application/octet-stream, Size: 3094 bytes --]

From 28764d55bf06f2b81a33ea03258ba62b9c02a6b9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 18 Aug 2020 17:00:15 +0200
Subject: [PATCH] Use Unicode replacement character for unencodable UTF-16
 characters

Use the standard U+FFFD REPLACEMENT CHARACTER instead of a space to
replace characters that cannot be encoded in UTF-16.

* lisp/international/mule-conf.el (utf-16le, utf-16be)
(utf-16le-with-signature, utf-16be-with-signature, utf-16):
Use U+FFFD as :default-char.
* test/src/coding-tests.el (coding-utf-16-replacement-char): New test.
---
 lisp/international/mule-conf.el |  5 +++++
 test/src/coding-tests.el        | 12 ++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/lisp/international/mule-conf.el b/lisp/international/mule-conf.el
index edda79ba4e..b9acafc158 100644
--- a/lisp/international/mule-conf.el
+++ b/lisp/international/mule-conf.el
@@ -1336,6 +1336,7 @@ 'utf-16le
   :mnemonic ?U
   :charset-list '(unicode)
   :endian 'little
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16le)
 
@@ -1345,6 +1346,7 @@ 'utf-16be
   :mnemonic ?U
   :charset-list '(unicode)
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16be)
 
@@ -1355,6 +1357,7 @@ 'utf-16le-with-signature
   :charset-list '(unicode)
   :bom t
   :endian 'little
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
@@ -1365,6 +1368,7 @@ 'utf-16be-with-signature
   :charset-list '(unicode)
   :bom t
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
@@ -1375,6 +1379,7 @@ 'utf-16
   :charset-list '(unicode)
   :bom '(utf-16le-with-signature . utf-16be-with-signature)
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
diff --git a/test/src/coding-tests.el b/test/src/coding-tests.el
index c438ae22ce..8b0adf0ad8 100644
--- a/test/src/coding-tests.el
+++ b/test/src/coding-tests.el
@@ -429,6 +429,18 @@ coding-check-coding-systems-region
                  '((iso-latin-1 3) (us-ascii 1 3))))
   (should-error (check-coding-systems-region "å" nil '(bad-coding-system))))
 
+(ert-deftest coding-utf-16-replacement-char ()
+  (should (equal (encode-coding-string "A\351B" 'utf-16be)
+                 (unibyte-string 0 ?A #xff #xfd 0 ?B)))
+  (should (equal (encode-coding-string "A\351B" 'utf-16le)
+                 (unibyte-string ?A 0 #xfd #xff ?B 0)))
+  (should (equal (encode-coding-string "A\ud8b6BΣ\227D𝄞" 'utf-16be)
+                 (unibyte-string 0 ?A #xd8 #xb6 0 ?B #x03 #xa3 #xff #xfd 0 ?D
+                                 #xd8 #x34 #xdd #x1e)))
+  (should (equal (encode-coding-string "A\ud8b6BΣ\227D𝄞" 'utf-16le)
+                 (unibyte-string ?A 0 #xb6 #xd8 ?B 0 #xa3 #x03 #xfd #xff ?D 0
+                                 #x34 #xd8 #x1e #xdd))))
+
 ;; Local Variables:
 ;; byte-compile-warnings: (not obsolete)
 ;; End:
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Use the Unicode replacement character for replacing unencodable characters into UTF-16
  2020-08-18 15:36 Use the Unicode replacement character for replacing unencodable characters into UTF-16 Mattias Engdegård
@ 2020-08-18 16:19 ` Eli Zaretskii
  2020-08-18 17:07   ` Mattias Engdegård
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2020-08-18 16:19 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> Feedback-ID: mattiase@acm.or
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 18 Aug 2020 17:36:10 +0200
> 
> The attached patch makes sure that non-Unicode characters are replaced with U+FFFD REPLACEMENT CHARACTER instead of a space when converting to UTF-16. (The space is from all evidence a historical accident.)

Can you describe under which circumstances this default-character will
be used?

The issue that bothers me is whether u+FFFD can appear in situations
where it cannot be displayed by Emacs, because then the result will be
more confusing than helping.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Use the Unicode replacement character for replacing unencodable characters into UTF-16
  2020-08-18 16:19 ` Eli Zaretskii
@ 2020-08-18 17:07   ` Mattias Engdegård
  2020-08-18 18:13     ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Mattias Engdegård @ 2020-08-18 17:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

18 aug. 2020 kl. 18.19 skrev Eli Zaretskii <eliz@gnu.org>:

> Can you describe under which circumstances this default-character will
> be used?

It's what encoding into UTF-16 uses for characters that don't have a Unicode equivalent, such as raw bytes.

Now:

 (encode-coding-string "X\377Y" 'utf-16be)
 => "X Y" (in UTF-16-BE)

With the patch:

 (encode-coding-string "X\377Y" 'utf-16be)
 => "X\ufffdY" (in UTF-16-BE)

> The issue that bothers me is whether u+FFFD can appear in situations
> where it cannot be displayed by Emacs, because then the result will be
> more confusing than helping.

Do you mean that on balance, all things considered, you prefer space as replacement character to U+FFFD?




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Use the Unicode replacement character for replacing unencodable characters into UTF-16
  2020-08-18 17:07   ` Mattias Engdegård
@ 2020-08-18 18:13     ` Eli Zaretskii
  2020-08-18 19:43       ` Mattias Engdegård
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2020-08-18 18:13 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 18 Aug 2020 19:07:41 +0200
> Cc: emacs-devel@gnu.org
> 
> 18 aug. 2020 kl. 18.19 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Can you describe under which circumstances this default-character will
> > be used?
> 
> It's what encoding into UTF-16 uses for characters that don't have a Unicode equivalent, such as raw bytes.

My reading is that this happens only for codepoints beyond 0x10ffff.
Raw bytes end up there, but I'm not sure they always end up there.
Characters that aren't unified also end up there.

> > The issue that bothers me is whether u+FFFD can appear in situations
> > where it cannot be displayed by Emacs, because then the result will be
> > more confusing than helping.
> 
> Do you mean that on balance, all things considered, you prefer space as replacement character to U+FFFD?

I mean if the situation that bother do in fact exist (I'm not sure
they do), we should discuss them and see whether we care about them.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Use the Unicode replacement character for replacing unencodable characters into UTF-16
  2020-08-18 18:13     ` Eli Zaretskii
@ 2020-08-18 19:43       ` Mattias Engdegård
  0 siblings, 0 replies; 5+ messages in thread
From: Mattias Engdegård @ 2020-08-18 19:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

18 aug. 2020 kl. 20.13 skrev Eli Zaretskii <eliz@gnu.org>:

> My reading is that this happens only for codepoints beyond 0x10ffff.
> Raw bytes end up there, but I'm not sure they always end up there.
> Characters that aren't unified also end up there.

As far as I can tell raw bytes (in the [128,255] range) are always replaced, even when the input is a unibyte string.




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-18 19:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-18 15:36 Use the Unicode replacement character for replacing unencodable characters into UTF-16 Mattias Engdegård
2020-08-18 16:19 ` Eli Zaretskii
2020-08-18 17:07   ` Mattias Engdegård
2020-08-18 18:13     ` Eli Zaretskii
2020-08-18 19:43       ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).