unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Use the Unicode replacement character for replacing unencodable characters into UTF-16
@ 2020-08-18 15:36 Mattias Engdegård
  2020-08-18 16:19 ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Mattias Engdegård @ 2020-08-18 15:36 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 471 bytes --]

The attached patch makes sure that non-Unicode characters are replaced with U+FFFD REPLACEMENT CHARACTER instead of a space when converting to UTF-16. (The space is from all evidence a historical accident.)

This change is required for one possible solution of bug#42904. We can do without this patch, but it fixes a clear bug.

For some reason, unpaired surrogates aren't affected despite not being encodable in UTF-16 -- another bug, but not one addressed here.


[-- Attachment #2: 0001-Use-Unicode-replacement-character-for-unencodable-UT.patch --]
[-- Type: application/octet-stream, Size: 3094 bytes --]

From 28764d55bf06f2b81a33ea03258ba62b9c02a6b9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 18 Aug 2020 17:00:15 +0200
Subject: [PATCH] Use Unicode replacement character for unencodable UTF-16
 characters

Use the standard U+FFFD REPLACEMENT CHARACTER instead of a space to
replace characters that cannot be encoded in UTF-16.

* lisp/international/mule-conf.el (utf-16le, utf-16be)
(utf-16le-with-signature, utf-16be-with-signature, utf-16):
Use U+FFFD as :default-char.
* test/src/coding-tests.el (coding-utf-16-replacement-char): New test.
---
 lisp/international/mule-conf.el |  5 +++++
 test/src/coding-tests.el        | 12 ++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/lisp/international/mule-conf.el b/lisp/international/mule-conf.el
index edda79ba4e..b9acafc158 100644
--- a/lisp/international/mule-conf.el
+++ b/lisp/international/mule-conf.el
@@ -1336,6 +1336,7 @@ 'utf-16le
   :mnemonic ?U
   :charset-list '(unicode)
   :endian 'little
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16le)
 
@@ -1345,6 +1346,7 @@ 'utf-16be
   :mnemonic ?U
   :charset-list '(unicode)
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16be)
 
@@ -1355,6 +1357,7 @@ 'utf-16le-with-signature
   :charset-list '(unicode)
   :bom t
   :endian 'little
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
@@ -1365,6 +1368,7 @@ 'utf-16be-with-signature
   :charset-list '(unicode)
   :bom t
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
@@ -1375,6 +1379,7 @@ 'utf-16
   :charset-list '(unicode)
   :bom '(utf-16le-with-signature . utf-16be-with-signature)
   :endian 'big
+  :default-char #xfffd
   :mime-text-unsuitable t
   :mime-charset 'utf-16)
 
diff --git a/test/src/coding-tests.el b/test/src/coding-tests.el
index c438ae22ce..8b0adf0ad8 100644
--- a/test/src/coding-tests.el
+++ b/test/src/coding-tests.el
@@ -429,6 +429,18 @@ coding-check-coding-systems-region
                  '((iso-latin-1 3) (us-ascii 1 3))))
   (should-error (check-coding-systems-region "å" nil '(bad-coding-system))))
 
+(ert-deftest coding-utf-16-replacement-char ()
+  (should (equal (encode-coding-string "A\351B" 'utf-16be)
+                 (unibyte-string 0 ?A #xff #xfd 0 ?B)))
+  (should (equal (encode-coding-string "A\351B" 'utf-16le)
+                 (unibyte-string ?A 0 #xfd #xff ?B 0)))
+  (should (equal (encode-coding-string "A\ud8b6BΣ\227D𝄞" 'utf-16be)
+                 (unibyte-string 0 ?A #xd8 #xb6 0 ?B #x03 #xa3 #xff #xfd 0 ?D
+                                 #xd8 #x34 #xdd #x1e)))
+  (should (equal (encode-coding-string "A\ud8b6BΣ\227D𝄞" 'utf-16le)
+                 (unibyte-string ?A 0 #xb6 #xd8 ?B 0 #xa3 #x03 #xfd #xff ?D 0
+                                 #x34 #xd8 #x1e #xdd))))
+
 ;; Local Variables:
 ;; byte-compile-warnings: (not obsolete)
 ;; End:
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-18 19:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-18 15:36 Use the Unicode replacement character for replacing unencodable characters into UTF-16 Mattias Engdegård
2020-08-18 16:19 ` Eli Zaretskii
2020-08-18 17:07   ` Mattias Engdegård
2020-08-18 18:13     ` Eli Zaretskii
2020-08-18 19:43       ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).